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ABSTRACT 



Continuous speech recognition assigns predetermined 
words to syntactic categories and defines the syntactic 
categories which can follow and precede each predeter- 
mined word. The recognition process is achieved by 
comparing the input sequence of speech signals to refer- 
ence values and summing those which are syntactically 
permissible until they form a valid word. Subsequent 
speech values to previouly calculated valid words are 
compared to reference values listed in syntactic catego- 
ries which can follow the predetermined word. For 
each word, values are updated indicating the current 
word's sequence number, syntax category, cumulative 
comparison sum, and the current list of compared 
words. Values are also stored for each word which 
identify the previous 'word, the following word and 
their syntax categories. This process is repeated until all 
input values have been processed. The results are then 
checked to verify valid syntax and the words with the 
closest match are read out. 

13 Claims, 3 Drawing Sheets 
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(d) a further sequence number assigned to the respec- 

PROCESS FOR THE RECOGNITION OF A tive entry, 

CONTINUOUS FLOW OF SPOKEN WORDS (e) a first evaluation value, 

(f) a second evaluation value and 
The invention relates to a process for the recognition 5 (g) a sequence of compared words, 
f a speech signal derived from a continuous flow of in that at least after .every comparison of a new speech 
spoken word. The speech signal consists of a temporal value with the last reference value of at least one 
sequence of speech values, each of which specifies a word, a new sequence number is determined and after 
section of the speech signal The speech values are each such comparison the group of entries of the 
compared with predetermined stored reference values. 10 third list associated with the sequence number stored 
A group of the reference values in each case represents m the second list at this word is searched through for 
one word of a predetermined vocabulary. The compari- entries in which the sequence contained in the 
son results are added up over various sequences of com- second specification begins with the compared word 
binations of reference values and speech values per for such entl T present, a new entry is de- 
sequence. Only such sequences of words are taken into 15 rivcd for me new group of the third list associated 
account whose order is permissible in accordance with wi th the new sequence number, 
a predetermined stored first list containing, for prede- m that subsequently, for each new entry where the 
termined syntactic categories, at least one assignment abbreviated sequence contained in the second specifi. 
per category to a combination of further syntactic cate- ^ J*™ Wlth a svntacUc for which a 
aories and/or words 20 least one MMSnnwnt w present in the first list, first 
A process of this kind is known from "Proc. ICASSP J**" entries ^e * the new group and, far- 
IEEE Conf. ASSP", Dallas, Apr. 1987, p. 69-72. In this J"™* a second further entry (ot tiie new group is 
* * f . i- /• . . « ' • ... . derived for each of the new and first further entries of 
known process, the first list is divided into two sublists, ^ e ne w SrwSh the second specification 
which on the one hand specify the assignment between contains ^ gmntv seauence ^ 

SyntaCtiC ffT 65 ^ ? ^ that deriving and^ nxSngt^first and second further 

the other hand specify the assignment of these catego- is repeated m ^ ^ at lcast one 

nes to nvo other, subordinate categories where appro- further entrV( no £? ond fur ther entry occurs, 

pnate. Both lists are used for each new speech signal, in m ^ sub ntlV( for ^ entrics of thc ncw group 

that retrospective observation is used each time to de- 3Q where the se ence 5egins with a word to be 

tenmne which category explains the preceding speech rccognizcd , a reference to the reference data of this 

section best At the end of the speech signal, that se- word ^ entered mto ^ second ^ 

qucnce of words can be traced back which resulted in m ^ subscqucnt iy the next speech value is compared 

the smallest total sum of all comparison results and ^ ^ reference values Q f aU words contained in 

which moreover correspond to the grammar provided 35 tnc seconc j and 

by the two lists. However, as a result of the retropsec- m ^ this course D f process steps is repeated until the 

tive observation for each new speech signal, it may last specc h value of the speech signal to be recog- 

occur that a sequence is not directly traced back to the nized, after the processing of which the last group of 

beguming and consequently the process eventually rec- tne ^ checked for all entries containing a 

ognizes two or even more sub-sequences within the 40 reference to the syntactic initial category and, as a v 

speech signal which are in each case grammatically second specification, an empty sequence and, as a f 

correct within themselves but the sub-sequences do not sequence number, that of the first group, and, from I 

match each other grammatically. these entries, the sequence of compared words is read 

The object of the invention is therefore to state a our ^ output form that entry having the smallest \ 

process of the type mentioned at the beginning which 45 first evaluation value. ~y 

functions more reliably and makes fewer demands of The process according to the invention combines the 

the form, that is to say the assignments of the first list, so advantages of a retrospectively directed hypothesiza- 

that even more than two further syntactic categories tiori of grammatically correct continuations for the 

and/or words can be assigned to a syntactic category. already processed part of the speech signal with a con- 

This object is achieved according to the invention, in 50 tinuous verification of these hypotheses starting from 
that a second list containing at least references to the the beginning. Moreover, the process according to the 
references values of all those words which are com- invention has the result that only those words, or the 
pared with the respective next speech value as well as a associated reference values, are compared with the 
sequence number per word, and in that in the course of input speech values that are permissible on the basis of 
the process, a third list is generated which contains, for 55 the grammar defined in the first list. Sampling values of 
each speech value which has been compared with the the speech signal which were obtained at 10 ms inter- 
last reference value of at least one word, a group having vals and reduced to their spectral values can be used as 
in each case a plurality of entries, each entry containing speech values. Other measures for preparing the speech 
in addition to a current sequence number sampling signals may, also be used; similarly the speech 

(a) a reference to a syntactic category of the first list, 60 values can be obtained from a plurality of sampling 

(b) a first specification for a sequence of compared values and represent, for example, diphones or pho- 
words and/or syntactic categores which are as- nemes or even larger units, which makes no essential 
signed to sequences of already compared speech difference to the process according to the invention, 
values, The process according to the invention has certain 

(c) a second specification for a sequence of words 65 formal similarities with a process described in "Comm. 
and/or syntactic categories which can be assigned of thc ACM", Vol. 13, No. 2, Feb. 1970, p. 94-102. This 
to subsequent speech values on the basis of the first process is, however, used to break down sentences in a 
list, written and hence unambiguous form into their gram- 
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matical components. This, however, presents problems second specification begins with the syntactic cate- 

in the automatic recognition of a continuous flow of gory to which this new or first further entry of the 

speech, in that the manner of speaking and speed of new group contains a reference, wherein the second 

speaking vary and the transitions between the words are entry contains the reference t the syntactic category 

fluid. The words within a sentence cannot therefore be 5 of the earlier entry, a sequence extended to include 

determined reliably, but only with a certain degree of the syntactic category of the current new or first 

probability. During recognition, therefore, many differ- further entry in the first specification, a sequence 

ent hypotheses for words or sequences of words must abbreviated to exclude the same syntactic category in 

be considered. For an automatic recognition, then, that the second specification, the sequence number of the 

word sequence is to be determined which was spoken 10 earlier entry as a further sequence number, the sum of 

with the greatest probability* the first evaluation value of the earlier entry and the 

It follows from the unreliability of the input data that difference between the two evaluation values of the 

for a larger vocabulary, for example of several hundred current entry as a first evaluation value, the corre- 

words, each new speech value input must be compared sponding evaluation value of the earlier entry and the 

with a very large number of different reference values IS sequence of compared words of the earlier entry 

obtained from the combination of all words, the words extended to include the sequence of compared words 

occuring, due to the impossibility of exactly determin- of the entry of the current group as a second evalua- 

ing the word boundaries, with different lengths with tion value, and 

respect to the reference words. When the restrictions in that each reference to the reference data, together 

which the grammar imposes on the combination possi- 20 with the associated first evaluation value and the 

bilities of words are taken into account, the number of sequence number of the relevant entry, is entered into 

the words and hence of the reference values with which the second list. In this way, the evaluations are taken 

each new speech value must be compared can be re- into account particularly well when completing rec- 

stricted. ognized word sequences and when hypothesizing 

The speech values are expediently compared with the 25 about correct continuations, 
reference values according to the method of dynamic In the process according to the invention, it is possi- 
programming, which is known, for example, from Ger- ble that in the course of recognition of sentence, two or 
man Offenlegungsschrift 3,215,868. It permits adapta- even more hypotheses converge at a common point, 
tion to a varying speed of speaking and the finding of that is to say two or more hypotheses result in the same 
the most probable word. 30 grammatical continuation, with the evaluation of the 
The process according to the invention is essentially converging hypotheses being different in general. In 
defined by the use of the second list and the special this case, it is expedient, according to a further develop- 
structure of the third list. A particularly favourable ment of the invention, that each first and second further 
order, particulary also with respect to the processing entry is only made provided that no entry is present in 
time, is obtained according to a development of the 35 the new group, which entry contains the same refer- 
invention, ence, the same first and second specification and the 
in that before the comparison of the first speech value, same further sequence number and in which the first 
the first group contains in first entries in each case a evaluation value is smaller than the first evaluation 
reference to a syntactic initial category, an empty value of the intended further entry and, in the case of 
sequence as a first specification, in each case another 40 such an entry being already present but with a greater 
of the combinations assigned to the initial category as evaluation value, the latter is deleted. Therefore, as a 
a second specification, and an intial value for the two result of this, converging hypotheses are not separately 
evaluation values, as well as, in further entries, all the traced further, but only the best one is taken into consid- 
categories which can be derived from the combina- eration further, since die other hypothese cannot obtain 
tions of the second specification with in each case 45 a better overall evaluation at the end of the sentence to 
corresponding combination, in that each new entry in be recognized due to the same continuation. The condi- 
the first specification contains a sequence extended to tions checked here ensure that really only those hypoth- 
include the compared word, a sequence abbreviated eses are combined which result in exactly the same 
to exclude the compared word in the second specifi- continuation. This represent, therefore, a recombination 
cation, the evaluation value incremented by the sum 50 of hypotheses on the grammatical level, 
of the comparison results of the word as a first evalua- It is of course possible that, in a group of entries in the 
tion value, the sequence extended to include the com- third list where there is a plurality of entries, the second 
pared word as a sequence of compared words, and sequence begins with the same word or the same 
furthermore the values of the entry present, method. In this case it is expedient according to a fur- 
in that the first further entries contain in each case a 55 ther development of the invention that a reference to 
reference to the syntactic category of the new entry, reference data is only entered into the second list pro- 
from which this first further entry is derived, an vidcd that no reference to the reference data of the same 
empty sequence in the first specification, in each case word and the same sequence number with a smaller 
another combination assigned to the syntactic cate- evaluation value is already present in said list, and, in 
gory as a sequence in the second specification, the 60 the case of such an entry beng already present but with 
new sequence number as a further sequence number, a greater evaluation value, the latter is deleted. A* word 
the evaluation value of the new entry for both evalua- or a word class need only be transferred from a group 
tion values, and an empty sequence as a sequence of into the second list once, since after the complete corn- 
compared words, parison of this word, all entries of the relevant group are 
in that for each second further entry, from that group 65 searched through to find whether the second sequence 
specified in the further sequence number of the rele- begins with this word therein. When the word from the 
vant new or further first entry, that earlier entry is entry with the smallest evaluation value is transferred 
read out where the sequence associated with the into the second list, this entry specifies the best possible 
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evaluation of a word sequence which has led to the tively long time in the second list* Since a drawling of 

word transferred into the second list* On the other this kind is assumed to be only limited, however, it is 

hand, the same word is transferred again and again into favourable t delete such entries corresponding to com- 

the second list from various groups. pletely recognized words as soon as possible. This can 

In the case of a vocabularly range necessary for real- 5 occur according to a further development of the inyen- 

istic application, however, the number of words to be tion in that a word is delected from the second list if the 

simultaneously compared, reduced by taking account of number of sequence numbers lying between the new 

the grammar, is still too large, so that to speed up the sequence number and the sequence number stored at the 

process, the search must be concentrated on the most word is greater than a limit value contained in the refer- 

promising hypotheses. This takes place according to a 10 ence values for this word In this manner, drawled ends 

development of the invention in that each new and each of words can still be captured while, on the other hand, 

first and second further entry is only made if its first such words, do not greatly impede to slow down the 

evaluation value is smaller than a threshold value which recognition procedure. 

is equal to the smallest evaluation value, extended by a An entry into the second list is made from cone- 
constant, of all entries currently contained in the second 15 spo riding entries of the third list* namely where the 
list Only such hypotheses are traced further, therefore, second sequence begins with a word to be recognized, 
whose evaluation value deviates from the best evalua- Particularly with an extensive vocabulary, a hypothesis 
tion value for a hypothesis only by the constant. The can be continued with a multitude of various words all 
time required for the recognition of a sentence can be belonging to the same grammatical word class, for ex- 
vastly reduced by this "cutting off* for less promising 20 ample, all verbs of the vocabulary. An entry is then also 
hypotheses. present in the third list for each individual word, so that 
This "cutting off' for unfavourable hypotheses, i.e. the latter becomes very extensive. In order, in particu- 
on the grammatical level, can also be correspondingly lar, to reduce the extent of the third list, it is expedient 
applied directly to the comparison of the words, to be according to a further development of "the invention 
precise preferably additionally. For thiis, according to a 25 that the first list contains assignments of predetermined 
further development of the invention, every entry in the syntactic categories to other syntactic categories and- 
second list whose first evaluation value is greater than /or word classes instead of words and in that, for all 
the threshold value is delected. In this manner, the entries of the new group where the second sequence 
extent of the second list is normally continually reduced begins with a word class, a reference to this word class 
during the comparison with the successive incoming 30 instead of to reference d ata of a word is entered into the 
speech values, while on the other hand it is continually second list All words to be entered into the second list 
extended when tracing the more favourable hypotheses or which must be compared with the next incoming 
from the third list. In this manner, the extent of the speech values are then determined unambiguously and 
second list remains restricted, so that the actual compar- directly from the word class. The third list then con- 
ison procedure is executed quickly. A uniform evalua- 35 tains only a single entry in each case for all words of this 
tion for both cases is achieved by the use of the first kind. 

evaluation value for the threshold value. The extent of the second list can also be reduced in a 
This threshold value can, however, only be deter- similar manner when, namely, a reference is not entered 
mined when a new speech value has been compared separately to each of the words to be recognized, but 
with the reference values of all the words contained in 40 likewise only to the word class, into the second list. In 
the second list, so that a further pass is necessary for this case it is expedient that, with each reference in the 
deleting entries in the second list in each case. This second list, an auxiliary list is called up containing for 
further pass can be dispensed with, according to a fur- each word class the references to the reference data of 
ther development of the invention, in that the evalua- the words belonging to this class, and these references 
tion value of the preceding speech value is used for the 45 call up the corresponding reference values from the 
determination of the threshold value as a smallest first further list The auxiliary list may have a very simple 
evaluation value of the entries of the second list. How- structure, since it only contains the assignment of a 
ever, since as a result of this a slightly smaller minimum word class to the individual associated words, 
first evaluation value results, the constant is increased For the realization of the recognition of speech sig- 
slightly so that the same threshold value is essentially 50 nal, devices are known having a transducer for convert- 
obtained here again as would result from the use of the ing a spoken sentence into an electrical speech signal 
last entries in the second list. The value of the constants and for forming speech values, having a first memory 
is in any case dependent on the demands made of the containing specifications on syntactic categories of nat- 
speech system. If this constant is selected to be large, ural language and their assignment to further syntactic 
many hypotheses will be traced, so that the time re- 55 categories and/or specifications for words or word 
quired for the entire recognition of a sentence increases, classes, having a further memory for reference values 
whereas with a smaller value for the constant, the cor- formed analogously to the speech values from sentences 
rect hypotheses may be lost in some very unfavourably spoken earlier, and having a comparison device con- 
spoken sentences, to that correct recognition can no nected to an output of the transducer and to a data 
longer occur. 60 output of the further memory to supply comparison 
The words or entries in the second list are deleted in results from the comparison of speech values with refer- 
particular when the actual word corresponding to the ence values. The development of such an arrangement 
current speech values has little resemblance to a word for carrying out the process according to the invention 
contained in the second list. Since the last reference is characterized in that a second memory which stores 
value of the last or a very similar word is, however, 65 the entries for the second list and a third memory which 
compared not only once but with several successive stores the entries for the third list are provided, the 
speech values, since for example the word ends can be contents of the second memory specifying at least a part 
spoken with a drawl, such a word would remain a rela- of the addresses of the further memory, and in that a 
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controller is present which is set up such that it address example, in a manner known from German Offen- 

thc first, the second and the third memory and records legungsschrift 3,215,868. In particular, varying speeds 

data in the second and the third memory and reads out of speaking can thus be taken into account, 

of these as well as out of the first memory and, on re- The reference values are selected in accordance with 

cerving a word end signal for at least one word, forms 5 a list, which is designated hereinafter as the second list 

the new, the first and the second further entries for the and which is stored in a memory or memory area 3. This 

third memory and subsequently the entries for the sec- second list contains references to words in coded form 

. ond memory and records in these and, after processing which, as described above, have been determined in 

of the last speech signal, outputs the complete word section 2 of the process and specify which word can 

string contained in the third memory with the smallest 10 represent the speech values coming in via input 12 on 

» evaluation value to an output device. In this arrange- th e basis of the grammatical rules. Each word is inci- 

\ ment, it is particularly expedient that the controller is a dentally assigned to a temporal sequence of reference 

I processor, in particular a programmed microprocessor. ' values> ^ temporal successive processing of which is 

Inasmuch as the third list contains references to word not mustratcd m mor e detail or is contained in the com- 

/ classes mst&td of the words themselves, it is moreover 15 paris on operation 7. The comparison result from the 

/ expedient that the output of the second memory is cou- comparison operation 7 is again stored in the second list 

/ pled to the address input of an auxihary memory, the 3 . As will be explained in more detail later, several 

( th^urth U 4 P ^ WOrds m " aCtive " m ^ that is t0 ^ associ ' 

c — ^ f memo 5 y ' . ated reference values are compared with each incoming 

Exemplary embodiments of the invention will be 20 , h val ^ ^ ^ £ m 3 * 

plained below m more detail with reference to the ^ sp ^ cations w ^dtoon to ^ coded wo y rd< 

FIG 1 shows a schematic overview of the two-step ?f T** ^ owev " most Jy of 

nature of the process according to the invention, m f len 8 ths ' K **** ?° nt!U * a fo 

FIG. 2 shows a flow chart to clarify the process, 25 V*"*"* « *» °*« ^various 

FIG. 3 shows the basic structure of the entries in the words " beg™ frequently at different tunes, the com- 
^ panson of a word with the incoming speech values can 

FIG. 4 shows the deterniining of the sequence of be fuU * t** * to ^ last reference value 

words or syntactic categories from an inputlequence of u w ° rd ^ compared with a speech value 

speech values. 30 wmle ^ beginning or the middle area of other words is 

FIG. 5 shows a schematic block circuit diagram for compared, 

carrying out the process according to the invention. a word has bcen completely compared, there- 

FIG. 1 shows a schematic diagram which clearly fore » ? k supplied together with the further, still to be 

illustrates the two sections into which the process can explained, specifications via path 8, which represents a 

be divided. The actual comparison of speech values, 35 ste P m ^ process, to section 2 and is used there to 

supplied via input 12, with reference values takes place update a list, hereinafter designated as the third list, and 

in section 1. The comparison results for completely j 3 stored m a memory or memory area 5. This third list 

compared words are supplied via path 8 to section 2, k modified yet again by further process steps, indicated 

where, on the one hard, these words are stored and, on bv the arrows 9 and 10, arrow 10 specifying the consid- 

the other hand, the comparison results of these in each 40 eration of a list which is hereinafter designated as the 

case completely recognized words, together with the fi fSt list and is stored in a memory or memory area 6. 

grammatical rules for this, are used to determined new This h'st has a structure such that the grammatical rules 

words which can follow next on the basis of these gram- of speech are taken into account when deterniining the 

matical rules, and these new words are supplied to sec- grammatically permissible next words to be compared, 

tion 1 via path 11, in order to determine or to complete 45 Arrow 9 indicates changes in the third list in the form of 

the reference values with which the following speech additional entries which are only formed on the basis of 

values, supplied via input 12, are to be compared. In this entries present in the third list, 

manner, the process alternates continually between The order of the overall process will be explained in 

section 1 and 2, until the last speech value is compared. more detail below with reference to the flow chart 

Subsequently, via output 14, the word sequence is out- 50 schematically represented in FIG. 2. The structure of 

put which has shown the best matching with the se- the third list in memory 5 is of importance for this, 

quence of speech values and which moreover corre- which is indicated more fully in FIG. 3. This third list is 

sponds to the stored grammatical rules. subdivided into a number of groups of entries, in which 

The speech values supplied via input 12 may, for entries 31 and 32 are specified for the first group with 

example, be short-time spectra obtained from the 55 the sequence number 0, entries 41 and 42 are specified 

speech signal at 10 ms intervals; they may however also for the next group with the sequence number k and 

be already further processed values, for example pho- entries 51 to 54 are specified for the last group with the 

nemes or diphones. The determining of the speech val- sequence number n. Each entry is subdivided into a 

ues, supplied via input 12, from the speech signal is not number of sections a to g, which form quasj-columns 

illustrated in more detail here, since this is performed in 60 over the row-by-row entries in the list. The sequence 

a usual manner and the details of this are not important numbers are specified only once for each group for the 

for the invention. sake of clarity. It is, however, actually contained in 

The reference values with which the speech values each entry in a corresponding section. This third list in 

are compared are obtained in the same manner as the FIG. 3 is first generated or filled up during the execu- 

supplied speech values from speech signals of previ- 65 tion of the processing with the successive incoming 

ously spoken sentences, and they are stored in a mem- speech values, that is to say initially the list is not pres- 

ory or memory area 4. The actual comparison is sym- ent at all or is empty. The number of entries per group 

bolized by the circle 7, and it may be carried out, for of the third list is not defined here, but is obtained from 
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the course of the process on the basis of the incoming 
speech values. 

In FIG. 2, symbol 20 represents the general start 
symbol In block 21, the group 0 of the entries in the 
third list in FIG. 3 is generated before the first speech 5 
value arrives, to be precise on the basis of the first list 
containing the assignment of syntactic categories of 
various orders to each other and to words. A syntactic 
category is in particular a part of a sentence, for exam- 
ple the object, which may consist of a subordinate 10 
clause, of a noun with or without adjectives, of a pro- 
noun or of an empty part of a sentence, that is to say 
may not be present at all. One line in the first list is 
available for the syntactic category "object" for each of 
these possibilities. The individual possibilities here in 15 
turn represent at least partly syntactic categories for 
which there are likewise various possibilities in the 
grammar, that is to say several lines in the first list, etc. 

The first list with the syntactic categories and their 
assignment to each other and to words is defined for the 20 
process and determines which sentences can be recog- 
nized with which grammatical structures. This list is 
therefore already available at the beginning of the pro* 
cess and is not subsequently changed. 

In every grammar, even in a very limited grammar 25 
only for simple sentences, a syntactic category is, how- 
ever, always present, namely an initial category which 
comprises each recognizable sentence and for which the 
first list contains one or in general a plurality of assign- 
ments to other syntactic categories which represent 30 
virtually the most basic division of a sentence. These 
most basic syntactic categories in turn contain further 
assignments in the first list, etc., unit finally each assign- 
ment chain results in words to be recognized. An advan- 
tageous intermediate solution consists incidentally in 35 
that the strings of assignments are not allowed to end in 
the words to be recognized themselves, but in word 
classes, such as nouns, verbs, etc., for example. This 
reduces the extent of the first list, and the third list, and 
possibly also the second list, quite substantially. This 40 
will become even clearer during further explanation of 
the process. Only at transition 11 from the second pro- 
cess step to the first, that is to say when the next words 
to be recognized are transferred, is the word class in 
question broken down into the individual words. In 45 
doing so, each word of the word class in question can be 
entered into the second list, or only the word class is 
also entered into the second list, and from this word 
class the reference values of the associated individual 
words are called up, for example via an auxiliary mem- 50 
ory. 

. In the first processing block 21 of the flow chart in 
FIG. 2, the entries for the group 0 are thus generated in 
the third list of FIG. 3. Any number may be selected as 
the sequence number of the individual groups; how- 55 
ever, it is expedient to select this to begin in natural 
numeric order and continue upwards, so that here the 
sequence number 0 is selected. 

The individual sections of each entry of the third list 
thus have the following meaning. 60 

(a) This is a reference to a syntactic category of the 
first list, that is to say expediently the address of those 
lines of the first list in which a syntactic category occurs 
for the first time. Since all assignments to a syntactic 
category are frequently required one after the other, 65 
these are arranged successively in the first, list, so that 
counting can be continued from the first line for the 
assignments of in each case one syntactic category. 
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(b) This section contains a specification for a se- 
quence of already compared words and/or syntactic 
categories, wherein this sequence has moreover already 
been checked to see that it matches the grammatical 
rules according to the first list. This is automatically 
ensured in that the sequence specified in sectin b repre- 
sent a part of the assignment to the syntactic category 
specified in section a. 

(c) This section contains a specification for a se- 
quence of words and/or syntactic categories which 
represents the rest of the assignment of the syntactic 
category in section a and hence specifies a sequence of 
possible words or categories still to be expected in the 
following speech signal. 

(d) This section contains a further sequence number 
of a group, to be precise either of the current group or 
an earlier group, depending on how the entry in ques- 
tion of the third list is formed. This will become clearer 
from the further description of the process. 

(e) This section contains accumulative evaluation 
value which specifies the total sum of all comparison 
sums between all previously arrived speech values and 
the total sequence of reference values belonging to the 
hypothesis of which the syntactic category specified in 
section a represents at least a part. 

(f) This section contains an evaluation value which 
had been attained as the syntactic category was begun 
according to section a. 

(g) This section contains a sequence of compared 
words corresponding to the sequence specified in sec- 
tion b. Since, in section b, however, also already only 
syntactic categories can be specified in part or com- 
pletely, which therefore combine in each case a plural- 
ity of words or, to be more precise, a plurality of word 
types, the individual recognized words must be re- 
corded separately in the correct sequence, which takes 
place on this section. 

First of all, the assignments to the initial category in 
the first list are now entered in the entries 31, 32 etc. in 
the group 0. Here, therefore, section a contains the 
reference to this initial category, for example to the first 
line in the first list, section b contains an empty se- 
quence, since no comparisons have been made yet, 
while section c contains the further syntactic categories 
assigned to the initial category. Section d contains the 
sequence number of the first group, i.e. 0. Sections e and 
f contain an initial value, expediently the value 0. Like- 
wise, section g does not yet contain anything, or just an. 
empty sequence. 

When in this manner all assignments to the initial 
category to entries have been processed, the first syn- 
tactic category of the section c is checked for each 
entry, to see whether an assignment to further catego- 
ries or words is present for this, and for each such as- 
signment a further entry is made in group 0, where 
section a of the new entry contains the first category of 
section c of the old entry, section c contains the as- 
signed sequence of syntactic categories or words or a 
mixture of these, while sections b and d to g contain the 
same values as for the first entry. Section c are likewise 
checked again in these extended entries to see whether 
it contains a sequence beginning with a syntactic cate- 
gory, etc., until finally only entries have been extended 
in which each sequence specified in section c begins 
with a word to be recognized, namely with one of the 
first possible words of all recognizable sentences. Since 
the words at the beginning of a sentence, however, all 
belong to one or a few different classes of word due to 
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the grammar, just as the possible words at the further correspondingly abbreviated sequence, in section d the 

positions within the sentence, it is expedient only to same sequence number as in section d of the entry read 

specify the word classes in section b and c in the entries out and in section e an evaluation value equal to the sum 

in the third list. Third reduces the extent of the third list of the first evaluation value from the entry read out and 

quite substantially, as is readily apparent. 5 the comparison sum attained for the word in question, 

No further assignment is now present for this first while the second evaluation value of the entry read out 

word or word class, but this first word or word class in is transferred, and in section g the sequence contained 

section c of the entries in question must be transferred therein is extended to include the completely compared 

into the second list, so that the first speech value can be word, specifically really only the compared word, not 

compared with the reference values of the permissible 10 the word classes. At the first completely compared 

first words of a sentence. Furthermore, the first evalua- word after the beginning, some of these values of the 

tion value contained in section e, that is to say usually 0, new entry seem rather trivial; however, in the further 

as well as the associated sequence number of the group, course of the process, the derivation of the individual 

that is to say in this case likewise the value 0, is trans* sections of each new entry has great significance, 

ferred into this second list This is the initial status of the IS Before a new entry of this kind is entered into the 

second list, which however, is frequently changed dur- current group of the third list, it is however still 

ing the course of the process. checked whether the evaluation value determined for 

After the second list has been created in this manner the section e is smaller than a threshold value. This 

with the words to be compared and the third list has threshold value is determined from the evaluation val- 

been created with the first group, the first speech value 20 ues of all entries currently contained in the second list, 

can be compared with reference values in block 22 in in that the smallest of these evaluation values is 

FIG. 2. The comparison result represents a measure of searched for and is increased by a fixed constant. As 

the matching or a distance value between the first already mentioned earlier, for each word or for each 

speech value and the corresponding reference values, word class in the second list, the smallest evaluation 

and this applies analogously also for the following 25 value in each case, that is to say the best possible evalua- 

speech values. Each new distance value determined is tion for a sequence of words, is recorded which led to 

added in a manner known per se to the minimum com- the word or the word class in question. In the continua- 

parison sum determined up until then. This occurs dur- tion of the hypotheses which are currently active via a 

ing the progressive course of the speech signal with a corresponding entry in the third list, although it is possi- 

plurality of temporally sequential reference values of 30 ble that a hypothesis which does not currently have the 

the same word in accordance with the principle of dy- most favourable evaluation value turns out to be in the 

namic programming, as is described, for example, in the further continuation more favourable than other hy- 

already mentioned German Offenlegungsschrift potheses, that is to say exhibits a lesser increase of the 

3,215,868, in order in particular to be able to compen- evaluation value, it is however assumed that this im- 

sate for varying speeds of speaking, until finally the last 35 provement is only limited. This limited improvement is 

reference value of a word has been compared, or more taken into account by the constant when the threshold 

precisely, has been compared for the fust time, since the value is determined. If, therefore, a new entry has a first 

next following speech value(s) is (are) likewise still evaluation value, greater than this threshold value, it is 

compared with the last reference value. Incidentally, assumed that this hypothesis can no longer lead to a 

the sequence number is increased with each new speech 40 word sequence with smallest overall evaluation even in 

value, but a new group with entries for a sequence the case of a favourable continuation. The value of the 

number is only made when the associated speech value constant thus represents a compromise, since if this is 

has been compared as mentioned with the last reference selected to be too small, it may result in the most favou- 

value of a word. Instead of this, an updating of the rable word sequence being lost, because it is temporar- 

sequence number can only take place at each word end 45 ily less favourable than others, while too large a con- 

of this kind. stant leads to too many hypotheses being further traced, 

This takes place in block 23 in FIG. 2. In doing so, a which increases the processing required for the recogni- 
new entry is made into the third list from such a com- tion of the speech signal considerably. Moreover, in the 
pletely compared word as well as from the sequence latter case, the possibility of an incorrect recognition is 
number stored at this word or the associated word class 50 increased. Overall, however, with a relatively high 
in the second list and from the evaluation value stored value of the constant, that is to say with a high thresh- 
there, which is increased by the comparison sum of this old, a substantial reduction of the processing time is still 
word. For this, in the group, the sequence number of achieved in contrast to the case when no threshold is 
which is stored at the completely compared word or the taken into consideration. 

associated word class in the second list, in the third list 55 Incidentally, the same threshold is also used in the 

at least section c of each entry is read out and it is comparison of the speech values with the reference 

checked whether the sequence contained therein begins values specified by the entries of the second list. In the 

with the word or the associated word class which has majority of hypotheses, continuation can namely be 

just been completely compared. At least one such entry made on the basis of the grammar with various words, 

must be present in any case, since it is only from such an 60 which becomes particularly clear if word classes instead 

entry of the third list that the recongnized word or the of words are entered in the second list. In most cases 

associated word class in the second list can have come. these possible words are substantially different, so that 

For each such entry, for the group with the current in the case of these words it is already evident after a 

sequence number in the third list, a new entry is gener- comparison of only part of the reference data of this 

ated which contains in section a the same reference to 65 word that it does not match the word contained in the 

the syntactic category as the entry read out, in section b speech signal sufficiently. This is evident from the fact 

the sequence extended to include the completely com- that for such words the sum from the associated evalua- 
pared word or the associated word class, in section c the tion value from the second list and the comparison 
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results rises sharply. If such words are no longer taken cannot have a higher evaluation value than the entries 

into consideration as soon as this sum exceeds the already present. 

threshold value, in that this word is then deleted in the When this has been performed for all new entries in 

second memory or, if word classes are stored in the question of the group with the current sequence num- 

second memory, the assignment of these badly match- 5 ber, it is checked, in accordance with block 25 in the 

ing words to the word class in question is no longer flow chart according to FIG. 2, whether in this group 

taken into consideration, then considerable processing an entry is present in which section c contains an empty 

time can likewise be saved. sequence, where therefore the sequence assigned to that 

A further reason for deleting a word in the second list category specified in section a has been completely 

or no longer taking the assignment of a word to a word 10 compared. 

class in the second memory into consideration arises In the case of such an entry, in the group whose 
from the fact that, as mentioned earlier, several succes- sequence number is specified in this entry of the current 
sive speech values are compared with the last reference group in section d, that earlier entry is searched for in 
value of a word in order to take a drawled pronuncia* which section c begins with that syntactic category 
tton at the end of a word into account Although, if the 15 specified in section a of the entry in question of the 
word in question in the speech signal is not spoken with current group. A second further entry for the current 
a drawl, the comparison sum when comparing the last group is then derived from the earlier entry, which 
reference value with further successive speech signals entry contains in section a the syntactic category in 
becomes rapidly larger, so that the sum of evaluation section a of the earlier entry, in section b the sequence 
value and comparison sum exceeds the threshold value, 20 extended to include the syntactic category of the entry 
however, the number of successive speech values com- in question of the current group, in section c the com- 
pared with the last reference value can be reduced in spondingly abbreviated sequence, in section d the se- 
that only an, if appropriate, word-dependent given quence number of the earlier entry, in section e the sum 
number of successive comparisons are made and subse- of the first evaluation value of the earlier entry and the 
quently the word in question is deleted from the second difference of the two evaluation values of the entry in 
memory. This can save further unnecessary processing question of the current group, in section f the corre- 
time. sponding evaluation value of the earlier entry and in 
If the last reference value of several words has been section g the string of the word sequences from both 
compared simultaneously, that is to say with a certain 3Q entries. In this manner, the hypotheses for older, higher- 
speech value, this process step is performed separately ranking syntactic categories are continually verified, 
for each word. When in this manner the corresponding Before this second further entry is actually made into 
new entries have been made in the third list for the the third list, it can be checked analogously as described 
completely compared word or all completely compared above whether the first evaluation value obtained by 
words, these individual new entries are checked to see 35 means of forming a sum has not become greater than the 
whether section c begins with a syntactic category. This threshold value. 

takes place in block 24 in the flow chart according to All entries of the current group are processed in this 

FIG. 2. If an entry is then found where section c fulfils manner. In this process, only entries can arise in which 

this condition, to the group with the current sequence section c contains an empty sequence, so that the pre- 

number are added first further entries which contain in 40 ceding process step is carried out again for all entries of 

section a the reference to the syntactic category of the current group, in which new categories are thus 

section c of the entry tested, in section b an empty se- entered, and the verification of the older hypotheses is 

quence, in section c the sequence of further syntactic subsequently continued again etc., until no new syntac- 

categories and/or words or word classes assigned to tic category has been entered any more. This is indi- 

this category, in section d the current sequence number, 45 cated by the decision lozenge 26 in the flow chart in 

in sections e and f the first evaluation value from section FIG. 2, which therefore performs the loop through 

e of the entry tested and in section g an empty sequence. blocks 24 and 25 until no new entry is made, when the 

Before this entry, or each first entry, is actually made, process then moves to block 27. In block 27, all entries 

however, it is checked whether an entry in the current of the current group of the third list are run through 

group is not already present whose specifications in 50 again, but it is now checked whether an entry in section 

sections a to d completely match the entry to be made. c is present whose associated sequence begins with a 

If such an entry is indeed already present in the current word or a word class. If such an entry is found, this 

group, it is checked whether the first evaluation value is word or word class is transferred into the second list 

greater than that of the entry to be made. If this is the together with the first evaluation value of this entry and 

case, the present entry is replaced by the entry to be 55 the sequence number of the current group, 

made, and in the other case, the newly derived first Before transferring into the second list, however, it is 

further entry is not made. The comparison of the sped- checked whether an entry with the. same word or the 

fications in sections a to d ensures that in this manner same word class and the same sequence number is not 

two hypotheses are recombined which, although they already present therein. If such an entry is found, it is 
have usually started from different points and have 60 checked whether the entry present has a greater evalua- 

taken different paths, however now should find exactly tion value than the entry to be made. If this is the case, 

the same continuation. In this case, only the hypothesis the entry present is replaced by the entry to be made, 

with the most favourable evaluation value up to now is otherwise no new entry is made into the second list. In 

therefore further traced, since another hypothesis can- this manner, the second list always only contains the 
not achieve a better evaluation value even at a later 65 entry of a particular word with the smallest evaluation 

stage. A comparison of the first evaluation value of each value of in each case a sequence number, which is of 

of these first further entries with the threshold value is significance for the abovementioned determining of the 

no longer necessary, since these first further entries threshold value. 
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It is checked in lozenge 28 whether the last speech This controller 72, which may be a microprocessor, for 
value has been input If this is not the case, the process example, addresses via the connection 75 a first memory 
returns to block 22 and the next speech value is pro- 70 containing the specifications of the first list on the 
cessed in the manner described. If, however, the last syntactic categories and their assignment to each other 
speech value has been reached, the process moves to 5 and to words or word classes, and supplies these via the 
block 29, in which the last group of the third list is connection 73 to the controller 72. This forms, in accor- 
searched for entries containing in section a the initial dance with the process described, data for the third list, 
category, in section c an empty sequence and in section f or which a third memory 78 is provided, which is 
d the sequence number of the first group as a sign that addressed from the controller 72 via the connection 77 
a grammatically complete sentence has been compared, 10 and which receives the data to be recorded via the 
and if several such entries are present, that entry is used connection 79. Analogously, the data read out from the 
which has the smallest first evaluation value in section third list are also supplied from the memory 78 via the 
e, and from this entry the word sequence in section g is connection 79 to the controller 72, so that the connec- 
read out which represents the sentence recognized with tion 79 & expediently of bidirectional nature, 
the greatest probability. This completes the process. 15 f ormc d from the third list, in accordance 

FIG. 4 illustrates how new hypotheses are developed ^ the process described, for the second list are sup. 
and old hypotheses are concluded with the assignments Ued b ^ 72 t0 the memory 76, which is 

during the process. A sequence of syntactic categories, addressed via the connection 81, via the connection 83. 
of which one is denoted by A, is assigned* the initial when entering, the next free location in the second list, 
category, which is denoted here by S. This syntactic 20 which havc free 

due to the deletion of a 

category A breaks down in turn into a sequence of a word> is then addressed, and for the processing of a new 
word w, a further syntactic category B and a further h val Hcd ^ ^ comicction 65 to thc 

word /'™ e ^f"* of words » ™ comparator 66, all entries of the second list are ad- 
signed to the syntactic category B The fint speech dressed one after the other in memory 76. Provided that 
values a(l) and following are akeady assigned to the 25 contains in each entry a direct reference 
initial category via a corresponding tree. The speech m fa d ^ M * of ^ entry * su - 
values a(h) to a(k) have matched the word w best The , . * . " 7 c 11 
foUowmg speech values a(k+ 1) to a(i- 1) were in turn ? hcd ™ " ° *^ re " ce *«* mem - 
the most siimlar to the word x. Analogously, the speech fry 68 as an address If the second hst however, con- 
values a(i) and following were most similar to theword 30 m *f «*» ^fcrences to word classes instead of 
y, etc. . . After the comparison of both words x and y is ^vidual words, which is more expedient m the case of 
completed, the process described verifies the syntactic exte f IVC vocabulary, this reference m question » 
category B t and hence the elements w and B forming su PP hed VIfl the connection 85a to the address mput of 
the first sub-sequence are verified for the syntactic cate- f» aiixiliary memory 80 which contains at each address 
gory A. As soon as the hypothesis for the word z is also 35 location the addresses of the reference data memory 68 
completed, category A is completely verified, and a for th u e ^dividual words be longing to the word class, 
hypothesis is built up after the following category, not aBd addresses are supphed one after the other via 
shown in more detail in FIG. 4. In this manner, in the the connection 856 to the reference data memory. De- 
process described, only hypotheses ever arise which are pending on thc course of the comparison of the previous 
guaranteed to be grammatically correct from the begin- 40 sequence of speech values with reference data, in each 
ning up to the current value, and this applies until the case the currently required reference data within the 
end of the speech signal, that is to say until the last respective word are addressed via the connection 63. 
speech value. As 50011 35 ^ e ena * °f ^ e sequence of reference data 

FIG. 5 illustrates schematically the block circuit dia- has been reached for a word, this is reported to the 

gram of a device for carrying out the process described. 45 controller 72 via the connection 69. This then extends 

The acoustic speech signal is captured via the micro- toe third list in memory 78 and, if appropriate, forms 

phone 62 and converted into an electrical signal, and new entries for further words or word classes to be 

this is digitized in the device 64, and speech values are compared in the second list in memory 76, as set out in 

formed therefrom, which may be, for example, short- toe process described above and subsequently the next 

time spectra of the speech signal, LPC coefficients or 50 speech value on the connection 65 is again processed, 

even phonemes. These speech values arc supplied one When the last speech value of the speech signal has 

after the other as multibit data words via the connection been input, which may be determined, for example, by 

65 to the comparator 66. the recognition of a fairly long pause in speaking, the 

The latter receives via the connection 67 reference word sequence with the best evaluation value is read 
data words from a memory 68 which contains for each 55 out from the third list in memory 68 and supplied via the 
word of a given vocabulary a sequence of reference connection 71 to an output device 74, which may be a 
data which has been formed analogously to the data display device or a printer or even a memory, for exam- 
words on the connection 65 in an earlier learning phase. pie. 
The individual words in the reference data memory 68 What is claimed is: 

are addressed either via the connection 85 from the 60 1. A process for the recognition of speech signal 

memory 76 or via the connection SSb from an auxiliary derived from a continuous flow of spoken words, which 

memory 80, which will both be explained later. The speech signal comprises a temporal sequence of speech 

individual reference data within the word are addressed values, each of which values specifies a section of the 

via the connection 63 from the comparator 66 or from a speech signal; comprising: 

controller 72, depending on in which manner known 65 comparing the speech values with predetermined 

per se the comparison is carried out in detail. stored reference values, a group of which reference 

The comparison results formed by the comparator 66 values represents one word of a predetermined 

are supplied to the controller 72 and processed there. vocabulary for forming an initial evaluation value; 
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summing the comparison results over various sequen- 
ces of combinations of reference values and speech 
values per sequence whose order is permissible in 
accordance with a predetermined stored first list 
containing, for predetermined syntactic categories, 
at least one assignment per category to a combina- 
tion of further syntactic categories and/or words 
for forming a cumulative evaluation value; 

generating a second list and a third list the second list 
including references to the reference values of all 
those words which are compared with the respec- 
tive next speech value as well as a sequence number 
per word, and the third list including, for each 
speech value which has been compared with the 
last reference value of at least one word, a plurality 
of entries, each entry including a current sequence 
number and: 

(a) a reference to a syntactic category of the first 
list, 

(b) a first specification for a sequence of compared 20 
words and/or syntactic categories which arc 
assigned to sequences of already compared 
speech values, 

(c) a second specification for a sequence of words 
and/or syntactic categories which can be as- 25 
signed to subsequent speech values on the basis 
of the first list, 

(d) a further sequence number assigned to the re- 
spective entry, 

(e) a first cumulative evaluation value, 30 
(0 a second initial evaluation value and 

(g) a sequence of compared words; 

determining a new sequence number at least after 
every comparison of a new speech value with the 
last reference value of at least one word, and after 35 
each such comparison, searching through the 
group of entries of the third list associated with the 
sequence number stored in the second list at this 
word for such entries in which the sequence con- 
tained in the second specification begins with the 40 
compared word, and deriving a new entry for each 
such entry present, for the new group of the third 
list associated with the new sequence number; 

making a first further entry in the new group for each 
new entry in which the abbreviated sequence con- 45 
tained in the second specification begins with a 
syntactic category, for which at least one assign- 
ment is present in the first list, and, deriving a sec- 
ond further entry for the new group for each of the 
new and first further entries of the new group for 50 
which the second specification contains an empty 
sequence; 

repeating the steps of deriving and making the first 
and second further entries alternately until, after at 
least one first further entry, no second further entry 55 
occurs; 

entering a reference to the reference data of the first 
word of each entry of the new group in which the 
second sequence begins with a word to be recog- 
nized; 

comparing the next speech value with the reference 
values of all words contained in the second list; 

repeating the process steps until the last speech value 
of the speech signal to be recognized has been 
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reading out the sequence of compared words from 
those entries having the smallest first evaluation 
value. 

2. A process according to claim 1, wherein: 

before the comparison of the first speech value, the 
first group contains: 

a plurality of first entries each entry including: a 
reference to a syntactic initial category, an empty 
sequence as a first specification, another of the 
combinations assigned to the initial category as a 
second specification, and an initial value for the 
two evaluation values; and 

further entries including all the categories which can 
be derived from each of the combinations of the 
second specification with a corresponding combi- 
nation; and 

wherein each new entry in the first specification con- 
tains a sequence extended to include the compared 
word, a sequence abbreviated to exclude the com- 
pared word in the second specification, the evalua- 
tion value incremented by the sum of the compari- 
son results of the word as a first evaluation value, 
the sequence extended to include the compared 
word as a sequence of compared words, and the 
values of the entry present; 
wherein each of the first further entries includes a 
reference to the syntactic category of the new 
entry, from which this first further entry is derived, 
an empty sequence in the first specification, an- 
other combination assigned to the syntactic cate- 
gory as a sequence in the second specification, the 
new sequence number as a further sequence num- 
ber, the evaluation value of the new entry for both 
evaluation values, and an empty sequence as a se- 
quence of compared words; 
reading out the earlier entry for each second further 
entry, from that group specified in the further se- 
quence number of the relevant new or further first 
entry, where the sequence associated with the sec- 
ond specification begins with the new syntactic 
category to which this new or first further entry of 
the new group contains a reference, wherein the 
second entry contains: the reference to the syntac- 
tic category of the earlier entry, a sequence ex- 
tended to include the syntactic category of the 
current new or first further entry in the first specifi- 
cation, a sequence abbreviated to exclude the same 
syntactic category in the second specification, the 
sequence number of the earlier entry as a further 
sequence number, the sum of the first evaluation 
value of the earlier entry and the difference be- 
tween the two evaluation values of the current 
entry as a first evaluation value, the corresponding 
evaluation value of the earlier entry and the se- 
quence of compared words of the earlier entry 
extended to include the sequence of compared 
- words of the entry of the current group as a second 

evaluation value; and 
entering into the second list each reference to the 
reference data, the associated first evaluation 
value, and the sequence number of the relevant 
entry. 



3. A process according to claim 2 comprising making 
each first and second further entry only if no entry is 
processed; checking the last group of the third list 65 present in the new group, which entry contains the 
for all entries containing: a reference to the syntac- same reference, the same first and second specification 
tic initial category, an empty sequence, and a se- and the same further sequence number and in which the 
quence number; and first evaluation value is smaller than the first evaluation 
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value of the intended further entry; and if such an entry 
is already present but with a greater evaluation value, 
deleting such entry. 

4. A process according to claim 2 comprising enter- 
ing a reference to reference data into the second list 
only if no reference to the reference data of the same 
word and the same sequence number with a smaller 
evaluation value is already present in said list; and if 
such entry is already present but with a greater evalua- 
tion value, deleting such entry. 

5. A process according to claim 2, comprising making 
each new and each first and second further entry only if 
its first evaluation value is smaller than a threshold 
value which is equal to the smallest first evaluation 13 
value, extended by a constant, of all entries currently 
contained in the second list. 

6. A process according to claim 5, comprising delet- 
ing every entry in the second list whose first evaluation 
value is greater than the threshold value. 

7. A process according to claim 6, comprising deter- 
mining from the evaluation value of the preceding 
speech value, the threshold value as a smallest first 
evaluation value of the entries of the second list. 

8. A process according to claim 2, comprising delet- 
ing a word from the second list if the number of se- 
quence numbers lying between the new sequence num- 
ber and the sequence number stored at the word is 
greater than a limit value contained in the reference 
values for this word. 

9. A process according to claim 1, comprising assign- 
ing predetermined syntactic categories to other syntac- 
tic categories and/or word classes in said first list; and 
entering a reference to a word class in said second list 
for all entries of the new group where the second se- 
quence begins with a word class. 

10. A process according to claim 9, comprising call- 
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up the corresponding reference values from the further 
list 

11. Apparatus for carrying out the process according 
to claim 1, comprising input means for receiving a spo- 
ken sentence in the form of an electrical speech signal; 

conversion means connected to said input means for 
forming speech values; a first memory containing 
specifications on syntactic categories of natural 
language and their assignment to further syntactic 
categories and/or specifications for words or word 
classes; 

a further memory for reference values formed analo- 
gously to the speech values from sentences spoken 
earlier; 

comparison means connected to an output of the 
conversion means and to a data output of the fur- 
ther memory for supplying comparison results 
from the comparison of speech values with refer- 
ence values; 

a second memory for storing the entries for the sec- 
ond list specifying at least a part of the address of 
the further memory; and 

a third memory for storing the entries for the third 
list; 

controller means for addressing the first, the second 
and the third memory and for recording data in the 
second and the third memory and reading out of 
the first, second and third memory and on receiv- 
ing a word end signal for at least one word, form- 
ing the new first and second further entries for the 
third memory and subsequently the entries for the 
second memory and recording in these; and, output 
means for outputting after processing the last 
speech signal the complete word string contained 
in the third memory with the smallest evaluation 
thereof. 

12. Apparatus according to claim 11, wherein the 
controller comprises a programmed microprocessor. 

13. Apparatus according to claim 11, comprising an 



ing up an auxiliary list with each reference in the second 40 auxiliary memory having an address input coupled to an 
list, an auxiliary list is called up containing for each output of the second memory and an output coupled to 
word class the reference to the reference data of the a partial address input of the further memory, 
words belonging to this class, and these references call * * * * * 
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[57] ABSTRACT 

A text classification system and method that can be used 
by an application for classifying natural language text 
input into a computer system having a domain specific 
knowledge base that includes a knowledge base having 
a plurality of categories. The text classification system 
classifies input natural language input text by first pars- 
ing the natural language input text into a first list of 
recognized keywords. This list is then used to deduce 
further tacts from the natural language input text which 
are then compiled into a second list Next, a numeric 
similarity score for each one of the plurality of catego- 
ries in the knowledge base is calculated which indicates 
how similar one of the plurality of categories is to the 
natural language input text A dynamic threshold is then 
applied to determine which ones of the plurality of 
categories are most similar to the recognized keywords 
of the natural language input text. A third list is com- 
piled of the ones of the plurality of categories deter- 
mined to be most similar to the recognized keywords. 
An optional rule base can be utilized to further refine 
the determination of which ones of the plurality of 
categories are most sirnilar to the recognized keywords 
of the natural language input text. Also, an optional 
learning capability can be added to improve the accu- 
racy of the text classification system. 

24 Claims, 6 Drawing Sheets 
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doing this, a knowledge engineer must spend a signifi- 

METHOD AND APPARATUS FOR TEXT cant amount of time tuning and experimenting with the 

CLASSIFICATION rules to arrive at the correct set of rules to ensure that 

the rules work together properly for the desired appli- 

FTELD OF THE INVENTION 5 cation. 

The present invention is directed to text classification Another shortcoming in the foregoing systems is that 
and, more particularly, to a computer based system for there is no built in mechanism to allow the knowledge 
text classification that provides a resource that can be base portion of a text classification system to learn from 
utilized by external applications for text classification. the input text over time to thereby increase system 
™ ~~,™™~^ 10 accuracy. The addition of a learning component to 
BACKGROUND OF THE INVENTION enhance the accuracy of a text classification system 

The growing volume of publicly available, machine* would be desirable to improve the performance of the 
readable textual information makes it increasingly nec- system over time. 

essary for businesses to automate the handling of such OTn#w „_ w ^ ^„ r^^-m*^ 

irforLtion to stay competitive. By automating the " SUMMARY OF THE INVENTION 

handling of text, businesses can decrease costs and in- The present invention provides a method and system 
crease quality in performing tasks that require access to for performing text classification. Specifically, the sys- 
textual information. tern provides a core structure that performs text classifi- 

A commercially important class of text processing cation for external applications. It provides a core run 
applications is text classification systems. Automated 20 time engine for executing text classification applications 
text classification systems identify the subject matter of around which the knowledge needed to perform text 
a piece of text as belonging to one or more categories classification can be built. 

from a potentially large predefined set of categories. Generally, the operating environment of the present 
Text classification includes a class of applications that invention includes a general purpose computer system 
can solve a variety of problems in the indexing and 25 which ^^0^ a central processing unit having mem- 
routmg of text or „ m( ± associated peripheral equipment such as disk 

Routing of text is useful m large organizations where ^ to md & k terminals. The 

there is a large volume of individual pieces of text that s of ^ t resides m eit her mem- 

needs to be sent to specific persons (e.g., technical sup- Qr one of ^ e devices It ^ mvoked b 

rx>rt specialists inside a large customer support ce^^^^ 30 £ lication runnin ^ central p roccssmg ^ to 
Indexing text is useful in attaching topic labels to infer- f^. fy * knowledge base is maintained on 

mation and partitioning the information space to aid . , f, - 0 . 

information Retrieval. Indexing can facilitate the re- * c ^ OT ^ ° ±€£ StOI >*< mcdlUm m * C 
trieval of information based upon the contents of text computerjptem. 

rather than boolean keyword searches from databases 35 ™ c meAod of classifying text according to the pres- 
that include information such as news articles, federal ent mventK)n ^ "P° n f 0 ^™* °/ Mtural lan ; 
regulations etc m P ut text wmcn 0311 06 su PP^ ed °y an external 

A number of different approaches have been devel- application. The input text is then parsed into a first list 
oped for automatic text processing. One approach is of recognized keywords which may include, e.g., 
based upon information retrieval techniques utilizing 40 words, phrases and regular expressions. The first list is 
boolean keyword searches. This approach, however, used to deduce further facts from the natural language 
has problems with accuracy. A second approach bor- m P ut text which **t ^ M m classifying the input text, 
rows natural language processing from artificial intelli- The deduced facts are then compiled into a second list, 
gence technology to achieve higher accuracy. While Then, utilizing the first list, the present invention calcu- 
natural language processing improves accuracy based 45 ktes * numeric similarity score for each one of a plural- 
upon an analysis of the meaning of input text, speed of itv of categories in the knowledge base which indicates 
execution and range of coverage becomes problematic now similar one of the plurality of categories is to the 
when such techniques are applied to large volumes of recognized keywords in the first list A dynamic thresh- 
text old is then applied to determine which ones of the cate- 

Others have recognized the foregoing shortcomings 50 gories are most similar to the recognized keywords of 
and have attempted to reach a middle ground between the natural language input text The result is a third list 
information retrieval techniques and natural language/- which includes the categories that the recognized key- 
knowledge-based techniques to achieve acceptable ac- words are most similar. At this point, the text classifica? - 
curacy without sacrificing speed of execution or range tion operation of the present invention is complete and 
of coverage. This has been accomplished through pre- 55 the first, second and third lists can be passed on to the 
dominantly rule based systems which parse the input external application for application specific processing, 
text using natural language morphology techniques, The architecture of the text classification system of 
attempt to recognize concepts in the text, and then use the present invention comprises a natural language 
a rule base to map from identified concepts to catego- module, an intelligent inferencer module and a similar- 
ries. 60 ity measuring module. The natural language module 

Text classification systems which rely upon rule-base extracts as much information as possible directly from 
techniques also suffer from a number of drawbacks. The natural language input text received by the text classifi- 
most significant drawback being that such systems re- cation system from an external application. The intelli- 
quire a significant amount of knowledge engineering to gent inferencer module deduces any and all relevant 
develop a working system appropriate for a desired text 65 information that is implicitly contained in the natural 
classification application. It becomes more difficult to language input text. The similarity measuring module 
develop an application using rule-based systems because calculates a numeric similarity score for each one of the 
all the requisite knowledge is placed into a rule base. By plurality of categories and applies a dynamic threshold 
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to the plurality of categ ries to ascertain which catego- FIG. 6 illustrates an exemplary implementation of the 

ries are potentially most similar to the natural language category disambiguation module illustrated in FIG. 2. 

m A domain specific knowledge base comprising a lexi- DETAILED DESCRIPTION 
con of keywords, a class hierarchy organization for 5 Referring now to the drawings, and initially to FIG. 
keywords and a class hierarchy organization for catego- 1, there is illustrated an exemplary embodiment of a 
ries is utilized by the text classification system of the system for implementing the present invention. The 
present invention. The knowledge base is provided by system 10 comprises a computer 12 having a memory 22 
the external application that is utilizing the text classifi- associated therewith and with associated peripheral 
cation system of the present invention. By allowing 10 equipment such as a disk drive and storage unit 14, a 
keyword and category classes, the present invention tape drive 16 and a video display terminal 18. The com- 
simplifies the maintenance of the lexicon and an op- puter 12 is generally any high performance computer 
tional rule base. Accuracy is also improved by allowing such as a Digital Equipment Corporation VAX 
multiple facts to be inferred from single keyword 6000-100. In conjunction with the computer 12, a do- 
classes. IS main specific knowledge base 20 that includes applica- 
An optional category disambiguation module can be tion-specific information is stored on the disk drive 14 
added to the system of the present invention to further and an application program 24 is stored in the memory 
refine the results obtained by the similarity measuring 22. 

module. Under such circumstances, the domain specific Referring now to FIG. 2, there is illustrated an exem- 
knowledge base can be adapted to include the optional 20 plary architecture for a text classification system 30 of 
rule base. By making the category disambiguation mod- the present invention. The system 30 comprises a natu- 
ule and the rule base optional, the present invention ral language module 32, an intelligent inferencer mod- 
provides a text classification application developer ule 34 and a similarity measuring module 36. An op- 
more flexibility by allowing the developer to decide tional category disambiguation module 38 and an op- 
whether or not to include the rule base. While eliminat- 25 tional relevance feedback learning module 40 are also 
ing the category disambiguation module and the rule shown in FIG. 2. Modules 32, 34 and 36 (and 38, if 
base may result in some loss of accuracy, the trade-off selected to be part of the system) comprise what is 
would be that development of an application is greatly hereinafter referred to as "the run time system" of the 
simplified. present invention. These modules are referred to as the 
If, however, an application developer decides to uti- 30 run time system because collectively, they are invoked 
lize the category disambiguation module and the rule- by the computer 12 (FIG. 1) to process and classify 
base, the task is simple and straightforward because natural language text received from an external source, 
most of the processing and comparison of the input text e.g., the application 24. 

is performed upstream in the architecture thereby FIG. 3 illustrates the system 30 of FIG. 2 with the 
greatly reducing the importance of the rule-base in the 35 domain specific knowledge base 20 of FIG. 1. As illus- 
text classification process. trated in FIG. 1, the knowledge base 20 is shown as 
The system of the present invention can also be being stored on the disk drive 14. It should be under- 
adapted to include an optional relevance feedback stood that it could also be stored in the memory 22 
learning module as an add-on to the system of the pres- (FIG. 1) or any other appropriate storage device cou- 
ent invention to learn over time to increase system accu- 40 pled to the computer 12. The knowledge base 20 is 
racy. It can operate independently of the text classifica- external to the system 30. The information stored in the 
tion system, e.g., in a batch mode. The relevance feed- knowledge base 20 is provided by an applications pro- 
back learning module utilizes information passed to it grammer who is charged with developing the applica- 
by the system to adjust values stored in a category tion 24 that is utilizing the system 30 to perform text 
profile knowledge base. Such information may include 45 classification functions. The modules which comprise 
a category determined most relevant to a given natural the domain specific knowledge base 20 are a lexicon 52, 
language input text, a category determined most rele- a keyword class hierarchy 54, keyword/category pro- 
vant to the same natural language input text by an exter- files 56 and an optional category selection rule base 58 
nal source, e.g., a human expert, (the categories may or (utilized when the optional category disambiguation 
may not be the same), and a list of keywords that pro- 50 module 38 is used). 

vide evidence for the categories selected along with the Each of the modules of the system 30 and the compo- 

amount of evidence they provide. nents of the knowledge base 20 are briefly discussed 

BRIEF DESCRIPTION OF THE DRAWINGS ^ function ofthe moduk 32 

FIG. 1 illustrates an exemplary computer system for 55 is to extract as much information as possible directly 

implementing a text classification system according to from natural language input text. The input text can be 

the present invention. any machine-readable natural language text as deter- 

FIG. 2 illustrates an exemplary architecture of the mined by the external application 24. The natural lan- 
modules utilized in a text classification system accord- guage module 32 uses the lexicon 52, which comprises 
ing to the present invention. 60 keywords which can include, for example, words, 

FIG. 3 shows the exemplary architecture illustrated phrases, and regular expressions, to identify all recog- 

in FIG. 2 together with an exemplary domain specific nized keywords in the natural language input text Spe- 

knowledge base. cifically, the module 32 extracts all the relevant infor- 

FIG. 4 illustrates an exemplary portion of the key- mation that is explicitly contained in the input text. The 

word class hierarchy. 65 natural language module 32 passes a list of all the recog* 

FIG. 5 illustrates an exemplary embodiment of the nized keywords to the intelligent inferencer module 34. 

modules that comprise the intelligent inferencer module An example of a natural language module of the type 

illustrated in FIG. 2. described above is disclosed in U.S. patent application 
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Ser. No. 07/729,445, entitled "Method and Apparatus 
for Efficient Morphological Text Analysis using a High 
Level Language for Compact Specification of Inflec- 
tional Paradigms," (hereinafter "the Morphological 
Text Analysis patent application") filed Jul. 12, 1991 5 
and assigned to Digital Equipment Corporation. This 
application is expressly incorporated herein by refer- 
ence. 

The list of recognized keywords passed to the intelli- 
gent inferencer module 34 is used to deduce any and all 10 
relevant information that is implicitly contained in the 
input text. To accomplish this task, the intelligent in- 
ferencer module 34 uses the keyword class hierarchy 54 
to deduce further facts from the information explicitly 
stated in the input text. Keywords are grouped into 15 
classes in the keyword class hierarchy 54. Each class 
has associated facts that are true when a member of the 
class is identified in the input text. 

For example, the input text may mention problems 
with a specific type of disk device but not explicitly 20 
mention that the problems are with a disk. The keyword 
class hierarchy 54 can include a class called "DISK 
DEVICES" with specific disks as members. The fact 
"(DEVICE TYPE=DISK)." can be attached to this 
class. When a specific disk device is identified, the fact 25 
"(DEVICE TYPE = DISK)" can be inferred even 
though the word "disk* 1 was not explicitly mentioned in 
the input text The intelligent inferencer module 34 also 
performs word substitutions in key phrases. The intelli- 
gent inferencer module 34 passes the list of recognized 30 
keywords and a list of all the extra facts that could be 
deduced from the recognized keywords to the similarity 
measuring module 36. 

The list of recognized keywords extracted from the 
input text passed to the similarity measuring module 36 35 
is used to calculate a numeric similarity score for each 
predefmed category. Each score indicates how similar a 
given category is to the input text The sinularity mea- 
suring module 36 uses a knowledge base of keyword/- 
category profiles 56 to determine the similarity score. 40 
Each category in the knowledge base of keyword/cate- 
gory profiles 56 has an associated profile. The profile 
tells the similarity measuring module 36 which key- 
words provide evidence for the given category. Associ- 
ated with each keyword in a profile is a numeric weight 45 
called a "profile weight" that tells the similarity measur- 
ing module 36 the amount of evidence a keyword pro- 
vides for the given category. The module 36 determines _ 
profile weights and combines the profile weights to 
arrive at similarity scores for all the categories. Once 50 
the similarity scores have been calculated, a dynamic 
threshold is applied to all of the categories defined in 
the domain specific knowledge base 20. Those catego- 
ries whose similarity scores are below the threshold are 
discarded from consideration as being potentially most 55 
similar to the input text The categories whose similarity 
scores are above the threshold are compiled into a list 
and are passed to the next module or directly to the 
external application 24 (not shown), along with the list 
of extracted keywords and the list of deduced facts, if 60 
there are any. 

The list of most similar categories, the list of ex- 
tracted keywords, and the list of deduced facts, if any, 
can then either be passed directly out to the external 
application 24, to the optional category disambiguation 65 
module 38 or to the optional relevance feedback learn- 
ing module 40. If a rule base is desired for a particular 
application, the information is passed to the category 
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disambiguation module 38. The module 38 uses the 
category selection rule base 58 to select certain catego- 
ries over other categories based on the list of recognized 
keywords and the list of deduced facts. This module 38 
further refines the list of the most similar categories and 
passes it, along with the list of recognized keywords and 
the list of deduced facts to the external application 24 
and, if desirable, the optional relevance feedback learn- 
ing module 40. 

The relevance feedback learning module 40 is an 
add-on to the run time system of the present invention. 
It can operate independently of the run time system, 
e.g., in a batch mode. The input to the relevance feed- 
back learning module 40 comprises the category deter- 
mined most relevant to a given input text the category 
determined most relevant to the same input text by an 
external source, e.g., a human expert, (the categories 
may or may not be the same), and the list of recognized 
keywords that provide evidence for the categories se- 
lected along with the amount of evidence they provide. 
The module 40 then takes this information and adjusts 
the profile weights in the keyword/category profiles 56 
accordingly. 

The task that the natural language module 32 of the 
run time system of the present invention performs is to 
extract all the relevant information that is explicitly 
contained in a natural language input text. To accom- 
plish this task, the module 32 uses the lexicon 52. The 
lexicon 52 contains all the information that is considered 
relevant for extraction purposes. 

A brief description of the processing performed by 
the natural language module 32 is set forth below. For 
a complete description, the reader is referred to the 
Morphological Text Analysis patent application, re- 
ferred to above, which is expressly incorporated herein 
by reference. 

The natural language module 32 allows the inclusion 
of single word nouns and multiple word noun phrases 
into the lexicon 52. The natural language module 32 will 
recognize the root form of a noun or noun phrase as 
well as morphological variants of the root, e.g. f plural 
form of the root noun or noun phrase. It also allows 
synonyms of a keyword to be entered into the lexicon 
52 which are useful when defining keyword classes or 
when writing disambiguation rules. 

Single word verbs can also be included in the lexicon 
52. The root form of a verb must be entered into the 
lexicon 52. This way, the module 32 will not only rec- 
ognize the root form, but morphological variants as 
well. For example, the verb "crash** in the lexicon 52 
will identify "crashes", "crashing", and "crashed". 

A limited form of multiple word verb phrases are 
allowed into the lexicon 52. In this case, a verb phrase 
is considered to be a single word verb combined with a 
single word noun or noun phrase subject/object (e.g., 
"Analyze Disk")- 

When keyword matching is performed for a verb 
phrase, each sentence in the input text is reviewed sepa- 
rately. For each sentence, the natural language module 
32 tries to find the verb contained in the verb phrase. If 
the verb is found, it then looks to see if the noun or noun 
phrase contained in the verb phrase is present in the 
sentence. If both the verb and the noun phrase are found 
in the same sentence, then the entire verb phrase has 
been identified. For example, if the lexicon 52 contains 
the verb phrase "Analyze Disk." One of the sentences 
in the input text that the present invention is parsing is 
the following: "I need help analyzing this damaged 



02/17/2004, EAST Version: 1.4.1 



5,371, 

7 

disk." The natural language module 32 will first identify 
the single keywords "analyze" and "disk" (analyzing is 
a morphological variant of analyze). Then it will notice 
that "analyze" is the verb part of a verb phrase. It will 
then search the list of recognized keywords for that 5 
sentence for the noun part of the phrase (in this case the 
word "disk"). Since "disk" is in the keyword list the 
present invention then identifies the verb phrase "Ana- 
lyze Disk." The process works exactly the same way for 
multiple word noun phrases inside the verb phrase (e.g., 10 
"Analyze Process Dump," instead of "Analyze Disk")- 

The lexicon 52 can also include single word regular 
expressions. If a regular expression is in the lexicon 52, 
then the natural language module 32 will identify any 
word in the input text that matches against the regular 15 
expression. Being able to define regular expressions in 
the lexicon 52 gives the maintainer of the lexicon 52 
more flexibility than being restricted to defining literal 
words and phrases. For example, the term "SYSSm-f" 
can be defined to match all the VMS (an operating 20 
system available from Digital Equipment Corporation) 
operating system service routines instead of having to 
enter the name of every operating system service rou- 
tine directly into the lexicon 52. 

Some of the syntax rules of the singular expressions 25 
allowed in the lexicon 52 are that an ordinary character 
matches that character; a period matches any character; 
a colon matches a class of characters described by the 
following character, e.g., ":a" matches any alphabetic, 
":d" matches digits, ":n" matches alphanumerics; an 30 
expression followed by an asterisk matches zero or 
more occurrences of that expression e.g., , *fo*" notches 
«f» «f 0 » »f 00 >» ) e t Ct an expression followed by a 
plus sign matches one or more occurrences of that ex- 
pression, e.g., **fo+" matches 'Too, etc." 35 

The output of the natural language module 32 is a list 
which is a collection of sublists where each sublist cor- 
responds to a single sentence in the input text and con- 
tains all the recognized keywords in that sentence. This 
list is passed to the intelligent inferencer module 34 for 40 
further analysis and possible augmentation as is de- 
scribed below. 

The intelligent inferencer module 34 takes the infor- 
mation extracted directly from the input text by the 
natural language module 32 and attempts to add to that 45 
information by deducing further facts that are implied 
by the keywords identified. This module 34 uses the 
keyword class hierarchy 54. Each class in the keyword 
class hierarchy 54 contains a group of keywords (al- 
ready denned in the lexicon 52) that share something in SO 
common. The classes are structured into a hierarchy 
such that classes themselves can be members of other 
classes. An exemplary portion of the keyword class 
hierarchy 54 is illustrated in FIG. 4. 

What is useful about these classes is that facts can be 55 
attached to them to deduce implied information if a 
member of a class is found in the input text. If a key- 
word class member is identified, then all the facts at- 
tached to that class are inferred and added to the list of 
deduced facts. In addition, all the facts attached to the 60 
parent classes are inferred and added to the list of de- 
duced facts as well. 

In addition to mferring new facts with keyword 
classes, more general descriptions of an identified key- 
word can be substituted in an attempt to match other 65 
key phrases. This process is called "keyword substitu- 
tion." It is an attempt to match key phrases in the lexi- 
con 52 that could not be matched explicitly. For exam- 
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pie, it may be desirable to match the phrase "Analyze 
Disk" every time "Analyze X" is detected, where X is 
a specific disk device. This is accomplished without 
having to enter a single verb phrase for every specific 
disk device into the lexicon 52 which would cause the 
maintenance of a lexicon to become problematic. 

Using keyword substitution, a group of like devices 
can be grouped into a class and a word attached to the 
class to be used as a substitute for matching phrases in 
the lexicon 52. Going back to the example above, a class 
of disk devices can be defined and the keyword "disk" 
can be associated as a substitute. This way, "Analyze 
RD54" (where RD54 is a model number of a disk drive) 
can be recognized as "Analyze Disk" without having to 
have "Analyze RD54" stored in the lexicon 52. 

The output of the intelligent inferencer module 34 is 
the list of all the extracted keywords and the list of all 
the deduced facts that the intelligent inferencer module 
34 was able to infer. Associated with each extracted 
keyword is a number designating the frequency of the 
keyword in the input text 

An exemplary embodiment of the modules which 
comprise the intelligent inferencer module 34 are shown 
in FIG. 5. The left hand side of FIG. 5 shows the two 
main modules of the intelligent inferencer module 34, a 
fact inferencer module 60 and a keyword substitution 
module 62. The right hand side of FIG. 5 shows that 
both modules 60 and 62 use the keyword class hierarchy 
54 (the same one illustrated in FIG. 3) as their knowl- 
edge base. The fact inferencer module 60 only utilizes 
the facts associated with the classes in the keyword 
class hierarchy 54 and the keyword substitution module 
62 only uses the keyword substitutes associated with the 
classes in the keyword class hierarchy 54. 

The fact inferencer module 60 follows a general 
method for attaching facts to keywords. This method, 
which is repeated for each keyword K, first searches 
the keyword class hierarchy 54 for all classes C, of 
which the identified keyword is a member. Then, all 
facts associated with C are added to a global list of 
deduced facts for each identified class C that K is a 
member. The step of adding all facts associated with the 
identified class C is then applied recursively on all of the 
parent classes of C. By following this method, the fact 
inferencer module 60 adds facts to the list of deduced 
facts. 

The keyword substitution module 62 similarly fol- 
lows a general method for substituting keywords. This 
method, which is repeated for each keyword K, first 
searches the keyword class hierarchy 54 for all classes 
C, of which K is a member. Then, all the substitution 
keywords S, associated with C are retrieved for each 
identified class C where K is a member. Then, S is 
substituted for K and an attempt is made to match verb 
phrases in the lexicon 52. If a match is found, it is added 
to a global list of identified keywords. Then, the steps of 
retrieving substitution keywords and substituting key- 
words are recursively applied on all of the parent 
classes of C. 

The similarity measuring module 36 is responsible for 
returning a numeric similarity score for each category 
in the keyword/category profile 56. Each score indi- 
cates how similar a given category is to the recognized 
keywords extracted from the natural language input 
text The similarity measuring module 36 uses the 
knowledge base of keyword/category profiles 56 to 
determine similarity scores for all of the categories 
denned. Each category in the keyword/category pro- 
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files 56 has its own profile containing the keywords that word occurs. The profile weight calculation formula is 
arc relevant to that category. Once the input text is as follows: 
parsed by the natural language module 32 and the intel- 
ligent inferencer module 34, a list of all the keywords ?W=lo&CAT/CF) 
present in the input text, as well as the number of times 5 

they occur in the input text called term frequency, is where CAT equals the total number of defined catego- 
assembled by the similarity measuring module 36. The ries and CF equals the collection frequency of the given 
category profile can be represented as a n-dimensional keyword (this formula uses only collection frequency), 
vector of the form C=(cl, c2, . . . , cn), where n equals Note that as CF increases, the profile weight decreases, 
the total number of possible keywords in the lexicon 52 10 This makes sense because if a keyword provides e vi- 
and the individual elements "ci" represents the corre- dence for a large number of categories then its profile 
spending profile weight of keyword "i" in the category weight should be lower than a keyword that provides 
profile. The input text can also be represented as a n- evidence for a small number of categories, 
dimensional vector of the form T=(tl, t2, . . . , tn), The keyword weight calculation formula is as fol- 
where n is as above and **tT represents the correspond- 15 lows: 
ing weight of keyword "i" in the input text Similarity 

between a category and an input text can then be mea- KW=(TF * log(CAT/CF)ycKW 
sured as the inner product between these corresponding 

vectors, which is defined as: where CAT and CF are as above, TF equals the term 

20 frequency of the keyword in the input text, and CKW is 

Sim(QT) = SUM(i = l >n) (ci * ti). the combined keyword weight and is calculated as fol- 
lows: 

The size of n can vary depending on the size of the 

keyword lexicon 52. ckw= SQUARE_ROOT(SUM(i = i, n) 

The similarity measuring module 36 includes a 25 CSQUAKEftf 7<*Crfr/cfl)) 

method for efficiently computing the inner product , . . * _ . t 

similarity measure so that when n becomes large the ^ he ' e * 1S ^} 0ta \ ° f ^ or6s ^J*** 

similarity measures can still be quickly calculated. The ^ text, tfi" and cfi ' are the term and coUecUon 

method assumes that each keyword in the lexicon 52 has frequencies for one of the found keywords, and CAT is 

a corresponding vector of categories that it provides 30 as Previously defined. 

evidence for and a profile weight for each category. ^ simdanty scores have been calculated for all 

This information can be quickly computed from the categories the similarity measuring module 36 apphes a 

category profile vectors described above. This is ao dynamic threshold to the list of categones This thresh- 

complished by first initializing all siinilarity scores for „ 18 a given tuneable offset from the simdanty score of 

all categories to zero. Then, for each keyword i identi- 35 most similar category. In other words, if N is the 

fied in the input text and for each category j in the highest sirnilanty score for the input text and M is the 

category vector of the keyword i, the keyword weight P"-defined ttaeshold offset, then N-M is the thresh- 

of keyword i is multiplied by the profile weight of cate- ^ ^ ^ones whose smnlanty scores are 

gory j. Then, the resulting product is added to the simi- , n ^low the threshold value are discarded and those 

larity score for the category j. above threshold value are compiled into a list and 

The foregoing method insures that only the identified passed to the next module, along with the list of recog- 

keywords and the categories they provide evidence for ^ d keywords and the list of deduced facts, 

are being multiplied together. All the other portions of M descnbed above, the foregoing results can be 

the inner products will equal zero anyway since the 45 passed directly to the external application 24, to the 

keyword weights will be zero (Lc, the keywords were relevance feedback learning module 40 or the category 

not identified in the input text). The run time perfor- disambiguation module 38. If the information is passed 

mance of this method is significantly better than per- *> toe optional category disambiguation module 38, it 

forming a straight summation of the products of the a rule base to select certain categories over other 

vector elements because of the large number of ele- 50 categories based on the list of recognized keywords and 

ments equaling zero in the vectors. the list of deduced facts. Rules are utilized to decide the 

Like keywords, categories can also be grouped into appropriate category when more than one category is a 

hierarchically structured classes. This feature allows a potential candidate for being the most similar. The left 

lexicon maintainer to define category class profiles as hand sides of the rules consist of CATEGORY and 

well as category profiles. The run time system of the 55 KEYWORD slot-value pairs and deduced facts. The 

present invention automatically translates category right hand sides of the rule merely assert a preselected 

class profiles into individual category profiles and in- preference for one category over another category (or 

corporates them into existing category profiles. Cate- set of categories). 

gory classes are also useful when writing disambigua- An example of a rule that could be used by the cate- 

tion rules. By having category classes, a single rule can 60 gory disambiguation module 38 is set forth below, 
operate on an entire class rather than writing individual 

rules for each category in a class. — — 

The initial weights for category profiles and keyword W: Z ™I^!£met-vax 

weights for input texts are ascertained by formulae used VMS-TAPE)) 

by the similarity measuring module 36 that uses both 65 (DEVICE— TYPE = DISK) 

term frequency and collection frequency as input In (keyword = **ACF") 

text classification terms, collection frequency is the ™EN: prefer vMS-HLftSYsrat over ?x 

number of category profiles in which a specific key- 
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This rule states that if VMS-FILE-SYSTEM and tions where the particular category was identified as the 
either DECNET-VAX or VMS-TAPE are potentially most similar are collected. Next, the input texts which 
most similar categories, and if the fact (DEVICE_TY- were correctly classified and which were not are deter- 
PE=DISK) was deduced by the intelligent inferencer mined. Then, the keyword weights for all the correctly 
module 34, and if the keyword ACP has been found (or 5 classified input texts are added to the corresponding 
one of its synonyms); then the category VMS-FILE- keyword profile weights in the category profile. Fi- 
SYSTEM will be preferred over either DECNET- nally, the keyword weights for all the incorrectly classi- 
VAX or VMS-TAPE. fied input texts are subtracted from the corresponding 

When the category disambiguation module 38 is in- keyword profile weights in the category profile. Also, 
voked, all the rules that can apply to the given input text 10 the correct category is determined and the keyword 
are fired and all the category preferences are recorded weights are added to the profile of that category. An 
by the category disambiguation module 38. As a result example of an application that may use the text classifi- 
of the firing of the rules, the list of categories whose cation system of the present invention is routing of 
similarity scores are above the threshold value is modi- customer service requests within a customer support 
fied to include only the most similar categories that do 15 center. Without an automated text classifier, human call 
not have any other category with preference over them. screeners interact with a call handling system and deter- 
This list, along with the list of recognized keywords and mine the appropriate group to send a customer service 
the list of deduced facts, is then passed to the applica- request A call handling system records all the pertinent 
tion 24 (and to the relevance feedback learning module information that a support specialist needs to solve the 

... 20 customer problem. With an automated text classifier, 

As described above, the category disambiguation the call handling system can automatically invoke the 
module 38 is detachable from the run time architecture ^ classification system of the present invention to 
of the present invention. If a particular text classifica- determine where to route the customer service request 
tion application has no heuristics for category selection, without human intervention. 

then the category disambiguation module 38 can be 25 The following section provides an example of a call 
bypassed and reliance can be placed solely on similarity handling application for a given customer service re- 
scores calculated by the similarity measuring module quest and shows how the individual modules of the text 
3$ to determine the most similar category. Detaching classification system of the present invention operate on 
the rule base will most likely result in a decrease in the the customer service request. The output of the text 
accuracy of the classification; but for some applications 30 classification system enables the application to route the 
no such rule base exists. By making the rule base detach- customer service request to the appropriate group. Al- 
able, the range of potential applications that can be though not shown here because it is application specific 
developed using the present invention is increased. processing, the call handling system would take this 

An exemplary implementation of the category disam- output and automatically send the customer service 
biguation module 38 is illustrated in FIG. 6. The top 35 request to the identified support group, 
portion of FIG. 6 shows the compile time processing Set forth below is an explanation of the processing 
needed to translate the category selection rule base 58 performed by the system of the present invention using 
into the proper syntax for a run time inference engine, an example natural language text input The example 
such as "CLIPS," which is a public domain inference text input is: 

engine developed by NASA. A rule compiler 68 takes 40 "While trying to backup my database to a TK70, the 
as input a category selection rule base 64 and category process died with the error AITDISABLED and 

class hierarchy 66. At run time, all the recognized key- produced a dump file. I need help analyzing the 

words, facts, and most similar categories (that are given dump file and getting the backup to work." 

as input to the category disambiguation module) are This input text is passed to the run time system by the 
translated into CLIPS facts 72 and are given as input 45 external application 24 in machine readable form. The 
(along with a CLIPS rule base 70) to the CLIPS infer- following explanation demonstrates how the present 
ence engine 74. The CLIPS inference engine 74 fires as invention processes this input text 
many rules as it can against the given facts. Each rule As discussed above, the processing begins with the 
firing returns a category preference. Once all the rules natural language module 32. The natural language mod- 
that can fire have fired, then all the category prefer- 50 ui e 32 utilizes the lexicon 52 to recognize words or 
ences are collected and used by the present invention to phrases in the natural language input text. Set forth 
come up with a final list of most similar categories (as below is an example of entries in the lexicon 52. Each 
described above). entry m the lexicon 52 has a corresponding identifier 

Once text classification is done and the information is which defines the entry type. For example, 
passed to the external application 24 for further applica- 55 "BACKUP" is identified as a verb and a noun in sepa- 
tion specific processing, the optional relevance feed- rate entries, 
back learning module 40 can be invoked to adjust the 
keyword/category profile weights to achieve better 
accuracy. The module 40 collects all the text classifica- 
tions over a predetermined period from either the simi- 60 
larity measuring module 36 or the optional category 
disambiguation module 38, whichever is the last module 
of the run time system. The classifications include the 
input text, the chosen most similar category, and the 
keyword weights for the extracted keywords. Then, the 65 

relevance feedback learning module 40 performs the Given the entries in the lexicon 52, the natural lan- 
following tasks for each category profile in the key- guage module 32 identifies the following keywords and 
word/category profiles 56. First, all the text classifica- phrases from the given natural language input text; 



BACKUP 


VERB 


BACKUP 


NOUN 


DATABASE 


NOUN 


AUDISABLED 


NOUN 


DUMP FILE 


NOUNPHRASE 


ANALYZE DUMP 


. VERB PHRASE 


TK£>:D+ 


REGULAR EXPRESSION 
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backup (twice, once as a verb and once as a noun), 

database, AIJDISABLED, TK70 (matched against the f/rcvw „„ — r<t _ Ar _ Tnyt ^ ^ nACBgwn 

regular expression), dump file, and analy2e dump. Two ((KEYWORDS: ^S^ °r^s^ i> 
interesting events happen in the recognition of the verb ("DUMP FILE" l) ("ANALYZE DUMP" 

phrase, "analyze dump " The first is that a morphologi- 5 0) 

calv^toftheverb-Wy^isid^edranalyz- <™** gSagr^ 

mg" being the morphological variant). The second is 



that the phrase was recognized as a single unit even 

though the two words that comprise it were not contig- Notice that the keywords are now not separated into 

nous in the input text. As previously described, the 10 groupings and that the information deduced 

natural language module 32 identifies the verb portion b * "^Ihgent inferencer module 34 is incorporated 
of the verb phrase and then looks for an occurrence of mt0 ±e ^ to stn J cture ' ™» ^ structure >s passed as 
the noun portionin the same sentence. In this case it was m P* to the simdanty measurmg module 36 

successful so the entire verb phrase matches. The list ls . The similarity measuring module 36 calculates a sum- 

* *t * - *. i i 1 1 <*>> 4 ^ j * 15 lanty measure for every category in the keyword/cate- 

that the natural language module 32 outputs as a data 3 ~. • * *t • j *- c ~i t_ , . A , 

*_ * j i •*<• ^ gory profiles 56 agamst the identified keywords in the 

structure to the mteUigent inferencer module 3* as the «J m ^ ft ^ numbers 

resultof parsing this input text would look something ^ ^ key ^ ord ^ by ^ 

™ S: term weights by using the term weighing formulae 

20 previously described To keep things simple in this 

((Si: ("BACKUP" 1) ("DATABASE" l) example, it is assumed that the term weights remain as 

CTK-DEVICE" 1) they are above. For this example, the following cate- 

(" AIJDISABLED" i) ("DUMP FILE" i)) gory profiles contained in the keyword/category pro- 

((82: ("ANALYZE DUMP** 1) ("BACKUP" 1)). files $6: 

The numbers wit* each identified keyword represent Category BACKUP " has the associated keyword 

the frequency of the keyword m the given sentence. "backup** 

This data Structure is passed as input to the intelligent Category RDB has the associated keywords 

inferencer module 34. „ nnM , "DIABASE" aad "AIJDISABLED" 

< . , , _ , , . . 30 Category DBMS has the associated keyword 

The intelligent inferencer module 34 uses class infor- "database" 

mation to deduce further information from the input Category tape has the associated keyword "TK- 

text For the purposes of this example, the following bugcheck fZL,^ ^ 

classes are defined to reside in the keyword class hierar- Category bugcheck ^^^^J^^^ 

chy 54: 35 DUMP'* 

Keyword Class TAPE-DEVICES, which is a group- 

j^J^ devices *** mcludcs It is assumed for this example that each keyword 

*K™ » each category has a weight of L It is also assumed that 

Keyword Class RDB-ERROR-MSGS, which is a other categories in the knowledge base, but 

grouping of all the error messages generated by the 40 that none of them have any keywords in their profiles 

product RDB, and includes "AIJDISABLED", that match the keywords found in the input text Also, 

Keyword Class ERROR-MSGS, which is a grouping it should be understood that the categories above have 

of all possible error messages and includes the key- other keywords in their profiles, but for simplicity, only 

word class RDB-ERROR-MSGS, the keywords that match keywords found in the input 

Category Class VTA-PRODUCTS, which is a group- 45 text are presented. The similarity measures for the cate- 

ing of all the VIA products, including RDB (RDB gories above are then as follows: 

is a category in the domain-specific knowledge 

base). — _ 

The following facts are associated with the above Simfr " BACKUP) = ^^; BACKUP " ta * keyword 

classes in the keyword class hierarchy 54; 50 sim(T, rdb) = i 

Sim(T, DBMS) = I 
San(T, TAPE) = 1 

Sim(T, BUGCHECK) = 2 



(CLASS = 


TAPE-DEVICE) 


-+ (DEVICE-TYPE = 






TAPE), 


(CLASS = 


RDB-ERROR-MSGS) 


(LAYERED-PROD = 






RDB), 


(CLASS « 


ERROR-MSGS) 


— (ERROR-MESSAGE = 






Skeyword). 



55 For this example, a category threshold offset of 0.5 is 
chosen. This means that only the categories with simi- 
larity measures above 1.5 (2—0.5) will pass on to the 
next module. The list of the most similar categories, 
along with the list of recognized keywords and the list 



Given these classes and associated facts, it can be *n . 7* , . \ ^ ^ ^ 

deuced that the DEVICE-TYPE is TAPEbecause of " i^^,?^^^^^^^ 

. , ^ ~ . . _ rtA ^ » . . , , ^ 36 outputs as a data structure would look something like 
the identification of TK70. A potential layered product ^ r 

is RDB because of the identification of AIJDISA- 
BLED as a RDB error message. An error message 



found in this input text is AIJDISABLED. 65 ((KEYWORDS: ("backup" 2) ("database*' l) 

The list of recognized keywords and the list of de- £S£5?K »> 

duced facts output by the intelligent inferencer module ^ 

34 as a data structure would look something like this: (FACTS: (device-type: tape) (layered- 



02/17/2004, EAST Version: 1.4.1 



15 

-continued 



5,371,807 



16 



(CATEGORIES: 



PROD: RDB) (ERROR-MESSAGE: 
AIJDISABLED)) 

(BACKUP 2) (RDB 2) (BUGCHECK 2))). 



To continue the example, a rule base is selected. 
There are two rules in the category selection rule base 
58 as follows: 



10 



15 



IF (KEYWORD = "BACKUP") and 

(LAYERED-PROD = VIA-PRODUCTS) 

THEN (PREFER VIA-PRODUCTS OVER BACKUP) 

IF (SKILL = BUGCHECK) and 

(LAYERED-PROD = VIA-PRODUCTS) 

and 

(NOT (EXISTS BUGCHECK TYPE)) 
THEN (PREFER VIA-PRODUCTS OVER BUGCHECK) 



These two rules use the class VIA-PRODUCTS 
which we defined previously as including the category 20 
RDB. Since the fact (LAYERED -PROD = RDB) is 
present, both of these rules will fire, the result being that 
the RDB category is preferred over both BACKUP and 
BUGCHECK. The final data structure output by the 
category disambiguation module 38, is as follows: 



25 



((KEYWORDS: 



(FACTS: 



(CATEGORIES: 
(PREFERENCES: 



("BACKUP* 2) ("DATABASE" I) 
(TK-DEVICE" 1) CAUDISABLED" 1) 
("DUMP FILE" 1) ("ANALYZE DUMP" 

D) 

(DEVICE-TYPE: TAPE) (LAYERED- 
PROD: RDB) (ERROR-MESSAGE: 
AIJDISABLED)) 
(RDB 2)) 

(RDB OVER BACKUP RULE- 1) (RDB 
OVER BUGCHECK RULE-2))). 



40 



The rule numbers are listed with the preferences so 
they can be accessed at run time to generate explana- 
tions to the user as to why the rules fired. 

Once a text classification operation is performed, 
control is returned to the invoking application, in this 
case a call handling system. The call handling system 
will then use the classification to route the customer 
service request to the appropriate support group. The 45 
call handling system can also store the service request 
and its classification for later use by the relevance feed- 
back learning module 40 (FIG. 2). After a given prede- 
termined period of time, the call handling system col- 
lects all the service requests and their classifications and 50 
passes them as input to the relevance feedback learning 
module 40. The relevance feedback learning module 40 
takes these classified requests and interacts with a 
human call routing expert via video display terminal 18 
(FIG. 1) to determine which ones were correctly and 55 
which ones were incorrectly routed. This learning mod- 
ule would then take the information from the call rout- 
ing expert and adjust the profile weights in the key- 
word/category profiles 56 as previously described. 

What is claimed is: 

1. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the method com- 
prising the steps of: 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 



(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced facts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list, com- 
prising the sub-steps of: 

(I) calculating a value for the dynamic threshold 
based upon a similarity score of a most similar 
category and a predefined threshold offset, and 

(II) classifying the categories based upon their 
respective similarity scores by discarding cate- 
gories whose similarity scores are below the 
threshold value; 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (f) into a third 
list; and 

(i) passing the first list, the second list and the third 
list to an external application. 

2. The method according to claim 1 wherein the 
keywords comprise words, phrases and regular expres- 
sions. 

3. The method according to claim 1 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 

30 common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
steps of using the first list to deduce further facts from 
the natural language input text and compiling the de- 
duced facts into a second list further are performed by 
the steps of: 

(a) searching the keyword class hierarchy to deter- 
mine if a keyword identified in the first list is a 
member of a class in the keyword class hierarchy; 

(b) when a keyword identified in the first list is a 
member of a class, 

(i) inferring all the facts attached to that class by 
adding them to the second list, and 

(ii) adding all the facts attached to all classes above 
the classes of which the identified keyword is a 
member in the keyword class hierarchy to the 
second list; and - 

(c) repeating steps (a) through (b) for each keyword 
in the first list. 

4. The method according to claim 2 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
step of using the first list to deduce further facts from 
the natural language input text further comprises the 
step of substituting general descriptions of an identified 
keyword in the first list in an attempt to match other 
phrases that could not be matched explicitly so that a 
group of similar keywords can be grouped into a class 
and a word can be attached to the class to be used as a 
substitute for matching phrases. 

5. The method according to claim 1 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
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identified in the natural language input text, wherein the 
steps of using the first list to deduce further facts from 
the natural language input text and compiling the de- 
duced facts into a second list further are performed by 
the steps of: 5 
(a) searching the keyword class hierarchy for all 

classes of which an identified keyword in the first 

list is a member; 
' (b) adding all facts associated with each one of the 

classes of which the identified keyword is a mem- 10 

ber to a global list of deduced facts; 

(c) recursively applying step (b) on all classes above 
the classes of which the identified keyword is a 
member in the keyword class hierarchy; and 

(d) repeating steps (a) through (c) for each keyword 15 
in the first list. 

6. The method according to claim 1 wherein the 
knowledge base includes a lexicon that includes words, 
phrases and expressions, and a keyword class hierarchy 
structured such that keywords that share something in 20 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
step of using the first list to deduce further facts from 
the natural language input text further comprises the 25 
steps of: 

(a) searching the keyword class hierarchy for all 
classes of which an identified keyword in the first 
list is a member; 



(b) locating all substitution keywords associated with 30 prising the steps of: 



text, the intelligent inferencer module includes 
means for compiling the deduced facts into a 
second list; 

a similarity measuring module for calculating a 
numeric similarity score for each one of the plu- 
rality of categories in the knowledge base to 
indicate how similar one of the plurality of cate- 
gories is to the natural language input text, the 
similarity measuring module includes: 
means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords 
of the natural language input text, and 
means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list; and 
a relevance feedback learning module for adjusting 
the profile weights in the keyword/category 
profiles in the domain specific knowledge base 
based upon the ones of the plurality of categories 
determined most relevant to the natural language 
input text by the similarity measuring module 
and a second ones of the plurality of categories 
determined most relevant to the natural language 
input text by an external source. 
8. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the method com- 
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each class of which the identified keyword is a 
member; 

(c) retrieving the located substitution keywords; 

(d) substituting the located substitution keywords for 
the identified keyword; 

(e) using the located substitution keywords to identify 
matches between the located substitution key- 
words and phrases in the lexicon; 

(f) recursively applying steps (b) through (e) on all 
classes above the classes of which the identified 40 
keyword is a member in the keyword class hierar- 
chy; and 

(g) repeating steps (a) through (f) for each keyword in 
the first list 

7. A text classification system comprising: 45 
memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base includes a knowledge base 
of keyword/category profiles, each category in the 50 
keyword/category profiles knowledge base having 
an associated profile which indicates what informa- 
tion provides evidence for a given category, the 
keyword/profile weight knowledge base arranged 
to have associated with each keyword in a profile a 55 
profile weight that represents the amount of evi- 
dence a keyword provides for a given category; 
and 

a computer coupled to the memory, the computer 
including: 

a natural language module for accepting as input 
into the computer natural language input text, 
the natural language module includes means for 
parsing the natural language input text into a first 
list of recognized keywords; 

an intelligent inferencer module for using the first 
list to deduce further facts from the information 
explicitly stated in the natural language input 
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(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 

(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced facts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list, the step 
of applying a dynamic threshold further compris- 
ing the sub-steps of: 

(1) calculating a value for the dynamic threshold 
based upon a similarity score of a most similar 
category and a predefined threshold offset, and 

(2) classifying the categories based upon their re- 
spective similarity scores by discarding catego- 
ries whose similarity scores are below the thresh- 
old value; and 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (f) into a third 
list 

9. The method according to claim 1 wherein the 
domain specific knowledge base further includes a rule 
base, the method further comprising the steps of: 

(a) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; and 

(b) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected. 

10. The method according to claim 1 wherein the 
domain specific knowledge base includes a knowledge 
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base of keyword/category profiles, each category in the 
keyword/category profiles knowledge base having an 
associated profile which indicates what information 
provides evidence for a given category, the keyword/- 
profile weight knowledge base is arranged to have asso- 
ciated with each keyword in a profile a profile weight 
that represents the amount of evidence a keyword pro- 
vides for a given category, the method further compris- 
ing the step of adjusting the profile weights in the key- 
word/category profiles in the domain specific knowl- 
edge base based upon the ones of the plurality of catego- 
ries determined most relevant to the natural language 
input text and a second ones of the plurality of catego- 
ries determined most relevant to the natural language 
input text by an external source. 

11. A method for routing customer service requests 
by a computer system in a customer support center 
which includes support groups to service customer 
requests, the computer system including a call handling 
system, a text classification system and memory having 
a domain specific knowledge base having a plurality of 
categories stored therein representative of the support 
groups within the customer support center, each sup- 
port group being identified by a name, the method com- 
prising the steps of: 

(a) receiving a customer service request by the com- 
puter system from the call handling system; 

(b) passing the customer service request to the text 
classification system to determine where to route 
the customer service request within the customer 
support center, 

(c) parsing the customer service request into a first list 
of recognized keywords; 
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the customer service request within the customer 
support center; 

(c) parsing the customer service request into a first list 
of recognized keywords; 

(d) using the first list to deduce further facts from the 
customer service request; 

(e) compiling the deduced facts into a second list; 

(f) calculating, utilizing the first list, a numeric simi- 
larity score for each one of the plurality of catego- 
ries in the knowledge base to indicate how similar 
each one of the plurality of categories is to the 
customer service request; 

(g) applying a dynamic threshold to identify which 
support groups should handle the customer service 
request by determining which ones of the plurality 
of categories are most similar to the recognized 
keywords of the customer service request; 

(h) compiling the ones of the plurality of categories 
determined to be most similar in step (g) into a third 
list; 

(i) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; 

(j) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected; 

(k) passing the first list, the second list and the third 
list back to the call handling system; and 

(1) routing the customer service request to the se- 
lected one of the support groups. 

13. The method according to claim 11 or 12 wherein 



(d) using the first list to deduce further facts from the 35 the domain specific knowledge base includes a knowl- 



customer service request; 

(e) compiling the deduced facts into a second list; 

(f) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 



edge base of keyword/category profiles, each category 
in the keyword/category profiles knowledge base hav- 
ing an associated profile which indicates what informa- 
tion provides evidence for a given category, the key- 



to indicate how similar each one of the plurality of word/profile weight knowledge base is arranged to 



categories is to the the customer service request; 
(g) applying a dynamic threshold to identify which 
one of the support groups should handle the cus- 
tomer service request by determining which ones 



have associated with each keyword in a profile a profile 
weight that represents the amount of evidence a key- 
word provides for a given category, the method further 
comprising the step of adjusting the profile weights in 



of the plurality of categories are most similar to the 45 the keyword/category profiles in the domain specific 



recognized keywords of the customer service re- 
quest; 

(h) compiling the ones of the plurality of categories 
determined to be most similar in step (g) into a third 
list; 50 

(i) passing the first list, the second list and the third 
list back to the call handling system; and 

(j) routing the customer service request to the identi- 
fied one of the support groups. 

12. A method for routing customer service requests 55 
by a computer system in a customer support center 
which includes support groups to service customer 
requests, the computer system including a call handling 
system, a text classification system and memory having 
a domain specific knowledge base having a plurality of 60 
categories stored therein representative of the support 
groups within the customer support center, each sup- 
port group being identified by a name, and a rule base, 
the method comprising the steps of: 

(a) receiving a customer service request by the com- 65 
puter system from the call handling system; 

(b) passing the customer service request to the text 
classification system to determine where to route 



knowledge base based upon the one of the support 
groups selected to handle the customer service request 
and a second one of the support groups determined 
most relevant to the natural language input text by an 
external source. 
14. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories wherein the 
domain specific knowledge base includes a knowl- 
edge base of keyword/category profiles, each cate- 
gory in the keyword/category profiles knowledge 
base having an associated profile which indicates 
what information provides evidence for a given 
category, the keyword/profile knowledge base is 
arranged to have associated with each keyword in 
a profile a profile weight that represents the 
amount of evidence a keyword provides for a given 
category; and 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 
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means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input text, 

means for compiling the deduced facts into a sec- 5 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 10 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 15 

means for adjusting the profile weights in the key- 
word/categories determined to be the most rele- 
vant to the natural language input text and a 
second ones of the plurality of categories deter- 
mined most relevant to the natural language 
input text by an external source, 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list, and 

means for passing the first list, the second list and 
the third list to an external application. 

15. The text classification system according to claim 
14 wherein the keywords comprises words, phrases and 
regular expressions. 30 

16. The text classification system according to claim 
14 wherein the domain specific knowledge base further 
includes a rule base and the computer further com- 
prises: 

means for utilizing the rule base to select certain ones 35 
of the plurality of categories that were determined 
to be most similar to the recognized keywords over 
other ones of the plurality of categories based on 
the first and second lists; and 

means for modifying the third list of the most similar ^ 
categories to include the certain ones of the plural- 
ity of categories selected. 

17. The text classification system according to claim 
14 wherein the domain specific knowledge base in- 
cludes a knowledge base of keyword/category profiles, 45 
each category in the keyword/category profiles knowl- 
edge base having an associated profile which indicates 
what information provides evidence for a given cate- 
gory, the keyword/profile weight knowledge base is 
arranged to have associated with each keyword in a 50 
profile a profile weight that represents the amount of 
evidence a keyword provides for a given category, 
wherein the computer further comprises means for 
adjusting the profile weights in the keyword/category 
profiles in the domain specific knowledge base based 55 
upon the ones of the plurality of categories determined 
most relevant to the natural language input text and a 
second ones of the plurality of categories determined 
most relevant to the natural language input text by an 
external source. 60 

18. A method for classifying natural "language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein and including a 
rule base, the method comprising the steps of: 65 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 



(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced tacts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list; 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (i) into a third 
list; 

(h) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; and 

(i) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected. 

19. The text classification system according to claim 
14 wherein the means for applying a dynamic threshold 
further includes: 

means for calculating a value for the dynamic thresh- 
old based upon a similarity score of a most similar 
category and a predefined threshold offset; and 

means for classifying the categories based upon their 
respective similarity scores by discarding catego- 
ries whose similarity scores are below the thresh- 
old value. 

20. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the knowledge 
base including a lexicon that includes words, phrases 
and expressions and a keyword class hierarchy struc- 
tured such that keywords that share something, in com- 
mon are grouped into classes, each class has associated 
facts that are true when a member of the class is identi- 
fied in the natural language inputs text, the method 
comprising the steps of: 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 

(c) using the first list to deduce further facts from the 
natural language input text comprising the sub- 
steps of: 

(1) searching the keyword class hierarchy for all 
classes of which an identified keyword in the 
first list is a member, 

(2) locating all substitution keywords associated 
with each class of which the identified keyword 
is a member, 

(3) retrieving the located substitution keywords, 

(4) substituting the located substitution keywords 
for the identified keyword, 

(5) using the located substitution keywords to iden- 
tify matches between the located substitution 
keywords and phrases in the lexicon, 

(6) recursively applying sub-steps (2) through (5) 
on all classes above the classes of which the 
identified keyword is a member in the keyword 
class hierarchy, and 

(7) repeating sub-steps (1) through (6) for each 
keyword in the first list; 

(d) compiling the deduced facts into a second list; 
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(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list; and 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (f) into a third 
list. 

21. A text classification system comprising: 
memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base including a rule base; and 

a computer coupled to the memory, the computer 
including: 

a natural language module for accepting as input 
into the computer natural language input text, 
the natural language module includes means for 20 
parsing the natural language input text into a first 
list of recognized keywords; 

an intelligent inferencer module for using the first 
list to deduce further facts from the information 
explicitly stated in the natural language input 25 
text, the intelligent inferencer module includes 
means for compiling the deduced facts into a 
second list; 

a similarity measuring module for calculating a 
numeric similarity score for each one of the plu- 
rality of categories in the knowledge base to 
indicate how similar one of the plurality of cate- 
gories is to the natural language input text, the 
similarity measuring module includes: 
means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords 
of the natural language input text, and 
means for compiling the ones of the plurality of 
categories determined to be most similar into a 40 
third list; and 
a category disambiguation module for utilizing the 
rule base to select certain ones of the plurality of 
categories determined to be most similar to the 
recognized keywords over other ones of the 45 
plurality of categories based on the first and 
second lists, the category disambiguation module 
includes means for modifying the third list of the 
most similar categories to include the certain 
ones of the plurality of categories selected. 

22. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a rule base and a plurality of categories; 
and 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 60 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input text, 

means for compiling the deduced facts into a sec- 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 
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the plurality of categories is to the natural lan- 
guage input text, 
means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list, 

means for utilizing the rule base to select certain 
ones of the plurality of categories that were de- 
termined to be most similar to the recognized 
keywords over other ones of the plurality of 
categories based on the first and second lists, and 

means for modifying the third list of the most simi- 
lar categories to include the certain ones of the 
plurality of categories selected. 

23. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories; and 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input text, 

means for compiling the deduced facts into a sec- 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 

means for calculating a value for the dynamic 
threshold based upon a similarity score of a most 
similar category and a predefined threshold off- 
set, 

means for classifying the categories based upon 
their respective similarity scores by discarding 
categories whose similarity scores are below the 
threshold value, and 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list. 

24. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base including a knowledge 
base of keyword/category profiles, each category 
in the keyword/category profiles knowledge base 
having an associated profile which indicates what 
information provides evidence for a given cate- 
gory, the keyword/profile weight knowledge base 
is arranged to have associated with each keyword 
in a profile a profile weight that represents the 
amount of evidence a keyword provides for a given 
category; and 

a computer coupled to the memory, the computer 
including: 
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means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 5 
from the natural language input text, 

means for compiling the deduced facts into a sec- 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 10 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 15 



26 

are most similar to the recognized keywords of 
the first list, 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list, and 

means for adjusting the profile weights in the key- 
word/category profiles in the domain specific 
knowledge base based upon the ones of the plu- 
rality of categories determined most relevant to 
the natural language input text and a second ones 
of the plurality of categories determined most 
relevant to the natural language input text by an 

external source. 
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